diff --git a/language-tips/R/scalability.md b/language-tips/R/scalability.md deleted file mode 100644 index e69de29..0000000 diff --git a/language-tips/python/missingness.md b/language-tips/python/missingness.md deleted file mode 100644 index 2a89adb..0000000 --- a/language-tips/python/missingness.md +++ /dev/null @@ -1,86 +0,0 @@ -### Missing Values in Python -Datasets often have missing values, and different languages handle missingness -in data differently. This doc is intended to be an introduction to how python, -numpy, and pandas handle missing data. A more in depth guide can be found [here](https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html). - -#### Pandas -Technically, pandas doesn't have its own implementation of a missing value, -it chooses to use two existing null values as sentinels for missingness. - -1. numpy's `np.nan` float -2. Python's `None` object - -By using two different types of sentinel values, a float and an object, pandas -covers missingness of most data types we might encounter in a dataset with little -additional overhead. - -#### [`np.nan`](https://numpy.org/doc/stable/user/misc.html) -"Nan" means "not a number", and numpy uses it to label a missing numeric datapoint. -In fact, numpy compiles `np.nan` as a floating point even when surrounding values -are integer type. - -##### Why is that helpful? -Well, we're not always going to store integer type data, and when we do store numeric -data of non-integer type, we can be certain our missing token has been allocated -enough room in our data structure. You probably wouldn't consciously notice this at -runtime, but changes from a smaller dtype that uses little memory to one that carries -more information to store can create a delay as our structure tries to make room. On -the other hand, if our placeholder reserves the slightly larger room, the operational -cost should be minimal. - -In case that is still too technical, think about it like reserving a meeting room. -Let's say we have a team of 10 people working on a project, and we decide to meet -in-person to discuss it. Initially, only 4 people say they can make it, so the -smaller meeting room is chosen. But when its meeting day, 2 more people say they can -make it, and now you need room for 6. Depending on how many meeting rooms are -available, this could be a big problem, but most likely it just takes some shuffling -around and a slight delay to the meeting. _Floating point numbers leave more room -numeric info than integers do, so the `np.nan` placeholder is read as a float._ - - -#### [`None`](https://docs.python.org/3/c-api/none.html) - -#### upcasting -In some cases, pandas will switch between the two chosen sentinel values when an -alternate might be more efficient, and this can be helpful to know when we're making -manipulations to our data. - -Let's think about an example of when this might happen. Say we have an array full of -integers, and we want to insert an placeholder for a value we don't have yet. - -At first, we have an array with dtype `integer`. - -But after we insert placeholder, `np.nan`, and re-evaluate the dtype, we realize it's -been changed to `float`. Remember how np.nan compiles to a float? The type of the data -gets upcasted when we do this operation to accommodate - -Here is a table to summarize some of the upcasting scenarios. - -| arr.dtype before | arr.dtype after | sentinel output | -| --- | --- | --- | -| `float` | no change | `np.nan` | -| `object` | no change | `np.nan`, `None` | -| `integer` | `float` | `np.nan` | -| `boolean` | `object` | `np.nan`, `None` | - -#### operations -(I think these are all referring to `data` as a `pd.DataFrame` object, but I need to double check) - -I. Arithmetic - - item -II. Boolean detection - - `data.isnull()` - - returns a series of Booleans for all datapoints referring to whether or not they are missing - - `data.notnull()` - - returns a series of Booleans for non-null datapoints - - `data[data.notnull()]` - - uses the `notnull()` series as an index and returns the corresponding non-null datapoints -III. Convenience - - `data.dropna()` - - removes rows with null values in any column from structure - - **Note:** there are some neat optional parameters to feed to `dropna()` if the default approach - is not ideal, namely `how` and `thresh` - - `data.fillna()` - - fills missing values with the value passed as an argument - -##### done. diff --git a/language-tips/python/scalability.md b/language-tips/python/scalability.md deleted file mode 100644 index 96f0544..0000000 --- a/language-tips/python/scalability.md +++ /dev/null @@ -1,130 +0,0 @@ -### Python -Because the Python language is type-agnostic, in order to compile it uses an Interpreter to make inferences about types at runtime. This can result in signficant bottlenecks in performance if we aren't careful about exactly how our code should be implemented. - -This document shows some samples from: -- [Python Performance Tuning: 20 Simple Tips](https://stackify.com/20-simple-python-performance-tuning-tips/) -- [High Performance Pandas: eval() and query()](https://jakevdp.github.io/PythonDataScienceHandbook/03.12-performance-eval-and-query.html) - -## *Tips* -- Purge unused dependencies -- Use as few global variables as possible -- [Built-ins](https://docs.python.org/3/library/functions.html) - - Don’t write your own version of a built-in method that does exactly the same thing! The built-in version will be faster and include better error handling than a custom implementation. -- Utilize memory profilers to identify bottlenecks in your code - -  - -## *Tricks* - -> Goal: Make a list of integers in a given range - -Possible solution: -``` -indices = [] -for i in range(len(some_list)): - indices.append(i) -``` -Better solution: -``` -indices = [ i for i in range(len(some_list)) ] -``` -  - -> Goal: Check if an exact match for a value is in a list - -Possible solution: -``` -target = 5 -for val in some_list: - if val == target: - (do work) -``` -OR -``` -target = 5 -if target in set(some_list): - (do work) -``` -Better solution: -``` -target = 5 -if target in some_list: - (do work) -``` -  - -> Goal: Find values in one list that are also present in another - -Possible solution: -``` -dupes = [] -for x in left_list: - for y in right_list: - if x==y: - dupes.append(x) -``` -Better solution: -``` -dupes = set(left_list) & set(right_list) -``` -  - -> Goal: Assign multiple values in one call - -Possible solution: -``` -def format_full_name( some_name ): - lower_name = some_name.lower() - return lower_name.split(“ “) - -name_list = format_full_name(“Some Guys Name”) -first = name_list[0] -middle = name_list[1] -last = name_list[2] -``` -Better solution: -``` -def format_full_name( some_name ): - lower_name = some_name.lower() - return lower_name.split(“ “) - -first, middle, last = format_full_name(“Some Guys Name”) -``` -  - -> Goal: Swap the contents of two variables - -Possible solution: -``` -temp = x -x = y -y = temp -``` -Better solution: -``` -x, y = y, x -``` -  - -> Goal: Combine multiple string values - -Possible solution: -``` - -``` -OR -``` -def rebuild_full_name( a_first, a_middle, a_last ): - return a_first + “ “ + a_middle + “ “ + a_last - -full_name = rebuild_full_name(first, middle, last) -``` -Better solution: -``` -def rebuild_full_name( a_first, a_middle, a_last ): - return “ “.join(a_first, a_middle, a_last) - -full_name = rebuild_full_name(first, middle, last) -``` - -# done. diff --git a/language-tips/python/set-operations.md b/language-tips/python/set-operations.md deleted file mode 100644 index 0821e5a..0000000 --- a/language-tips/python/set-operations.md +++ /dev/null @@ -1,19 +0,0 @@ -### Set operations in python -Set objects are useful for a number of reasons, and they come with some useful operations that can be handy in exploring relational data. - -#### Features -In the simplest case, let's say we are iterating through some data and we want to capture all unique values we find that meet a particular condition. - -If we use a `list`, we will see the lowest overhead to append a new item, [O(1)](https://wiki.python.org/moin/TimeComplexity) , but we will have to deduplicate the list contents after building the collection. - -If we use a `dict`, we will see decent - -**Note**: Because you don't have to use the name of an object to declare it in python, and sets and dicts both use {} to denote themselves, python makes an inference at runtime as to which type is intended, based on the structure of the first insertion. Since it wants to run the most efficiently, it will choose the lesser overhead `set` over `dict` unless it gets specific instructions otherwise. -As an example of this issue, let's say we want to use a dictionary as a template for the info we want to capture. -`info = {'name', 'address', 'phone'}` -What's wrong with this implementation? What is going to happen when we run `info['name'] = 'Kenny'`? -`TypeError: 'set' object does not support item assignment` -Instead, when we want to outline the keys a dictionary should have before we have corresponding values, we need to insert a placeholder value for each key so that the compiler leaves room for incoming values. - -#### Functions -Several operations are available to use that come with statistical context and diff --git a/tips/files/parquet.md b/tips/files/parquet.md new file mode 100644 index 0000000..d1aa71a --- /dev/null +++ b/tips/files/parquet.md @@ -0,0 +1,27 @@ +Author: LB +Maintainer: BP + +## Parquet Files +What is Parquet? According to Databricks.com, “Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk”. + +## What this means +Parquet files read, store and write data based on columns and not rows. It turns out that passing data in this way is less expensive than working in a row-oriented direction. + +## Benefits +According to Russell Jurney, the abilities of column-oriented formatting to store each column of data together and can load them one at a time. This leads to two performance optimizations: + 1. You only pay for the columns you load. This is called columnar storage. +Let m be the total number of columns in a file and n be the number of columns requested by the user. Loading n columns results in justn/m raw I/O volume. + 2. The similarity of values within separate columns results in more efficient compression. This is called columnar compression. + +Note the event_type column in both row and column-oriented formats in the diagram below. A compression algorithm will have a much easier time compressing repeats of the value party in this column if they make up the entire value for that row, as in the column-oriented format. By contrast, the row-oriented format requires the compression algorithm to figure out repeats occur at some offset in the row which will vary based on the values in the previous columns. This is a much more difficult task. + +(unlinked image representing storage format diagrams) + +The column-oriented storage format can load just the columns of interest. Within these columns, similar or repeated values such as ‘party’ within the ‘event_type’ column compress more efficiently. +Columnar storage combines with columnar compression to produce dramatic performance improvements for most applications that do not require every column in the file. I have often used PySpark to load CSV or JSON data that took a long time to load and converted it to Parquet format, after which using it with PySpark or even on a single computer in Pandas became quick and painless. + +More later... +- Pyarrow +- Fastparquet + +done.