Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added numpy_and_pandas/figures/row_column_ordering.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
712 changes: 712 additions & 0 deletions numpy_and_pandas/numpy_and_pandas.ipynb

Large diffs are not rendered by default.

17 changes: 17 additions & 0 deletions pandas_vs_PostgreSQL/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# pandas vs. PostgreSQL
A benchmark comparing pandas and PostgreSQL for various table sizes and commands.

## Getting Started
To start the benchmark, run `run_script.sh`. The number of test replicates and size of the table/DataFrame can be adjusted in `run_script.sh`. The table/DataFrame are created from csv files. If these csv files do not exist, `run_script.sh` creates them by calling on `create_dataset.py`.

### Prerequisites
Apart from having pandas and Postgre installed, you need the following Python packages.

```
contexttimer
numpy
psycopg2
```

### The Benchmark
The benchmark is performed for following tasks: load csv, select column, filter, and group by and applying aggregate function. The results are stored as a separate JSON files for pandas and Postgre. The JSON file contains the results for each task and for each table size. A write-up of the results are found in the `analysis_writeup.ipynb` notebook.
112 changes: 112 additions & 0 deletions pandas_vs_PostgreSQL/analysis_writeup.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"# pandas vs PostgreSQL\n",
"\n",
"Working on large data science projects usually involves the user accessing, manipulating, and retrieving data on a server. Next, the work flow moves client-side where the user will apply more refined data analysis and processing, typically tasks not possible or too clumsy to be done on the server. SQL (Structured Query Language) is ubiquitous in industry and data scientists will have to use it in their work to access data on the server.\n",
"\n",
"The line between what data manipulation should be done server-side using SQL or on the client-side using a language like Python is not clear. Further, people who are either uncomfortable or dislike using SQL may be tempted to keep server-side manipulation to a minimum and reserve more of those actions on the client-side. With powerful and popular Python libraries for data wrangling and manipulation, the temptation to keep server-side processing to a minimum has increased.\n",
"\n",
"This article will compare the execution time for several typical data manipulation tasks such as join and group by using PostgreSQL and pandas. PostgreSQL, often shortened as Postgres, is an object-relational database management system. It is free and open-source and runs on all major operating systems. Pandas is a Python data manipulation library that offers data structures akin to Excel spreadsheets and SQL tables and functions for manipulating those data structures.\n",
"\n",
"The performance will be measured for both tools for the following actions:\n",
"\n",
"- select columns\n",
"- filter rows\n",
"- group by and aggregation\n",
"- load a large CSV file\n",
"- join two tables\n",
"\n",
"How these tasks scale as a function of table size will be explored by running the analysis with datasets with ten to ten million rows. These datasets are stored as CSV files and have four columns; the entries of the first two columns are floats, the third are strings, while the last are integers representing a unique id. For joining two tables, a second dataset is used, a column of integer ids and a column of floats.\n",
"\n",
"For each of the five tasks listed above, the benchmark test was run for one hundred replicates for each dataset size. The Postgres part of the benchmark was performed using pgbench, a commandline program for running benchmark tests of Postgres. It can accept custom scripts containing SQL queries to perform the benchmark. The computer used for this study runs Ubuntu 16.04, with 16 GB of RAM, and an 8 core process at 1.8 GHz. The code used for this benchmark can be found on [GitHub](https://github.com/xofbd/pandas_vs_PostgreSQL). The repository contains all the code to run the benchmark, the results as JSON files, and figures plotting the comparison of the two methods.\n",
"\n",
"It is important for data scientists to know the limitations of their tools and what approaches are optimal in terms of time. Although smaller projects will not benefit a lot from speed up, small percentage gains in more data intensive applications will translate into large absolute time savings."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Benchmark Results\n",
"\n",
"The benchmarking relied on two datasets, referred to as A and B. The headers and data types (in paranthesis) for the columns of dataset A are \"score_1\" (float), \"score_2\" (float), \"id\" (integer), and \"section\" (string)\". For dataset B, the headers and data types are \"score_3\" (float) and \"id\" (integer). The \"id\" column relates the two datasets together. The figures below show the mean execution time as a function of the number of rows in the datasets, using a log-log axis. Below each figure, a description of the task and the code used for each tool is provided.\n",
"\n",
"<img src='figures/select_results_plot.png', style=\"width: 600px;\">\n",
"For selecting columns, one column from the table/DataFrame was returned. The code for this task are\n",
"<br> __pandas__: `df_A['score_1']`\n",
"<br> __Postgres__: `SELECT score_1 FROM test_table_A;`\n",
"\n",
"<img src='figures/filter_results_plot.png', style=\"width: 600px;\">\n",
"For filtering rows, a table/DataFrame was returned with only rows meeting a criterion. The code for this task are\n",
"<br> __pandas__: `df_A[self.df_A['section'] == 'A']`\n",
"<br> __Postgres__: `SELECT * FROM test_table_A WHERE section = 'A';`\n",
"\n",
"<img src='figures/groupby_agg_results_plot.png', style=\"width: 600px;\">\n",
"For grouping and applying aggregation functions, records are grouped by section and mean and maximum score are reported for the two scores. The code for this task are\n",
"<br> __pandas__: `df_A.groupby('section').agg({'score_1': 'mean', 'score_2': 'max'})`\n",
"<br> __Postgres__: `SELECT AVG(score_1), MAX(score_2)` \n",
"<br> `FROM test_table_A` \n",
"<br> `GROUP BY section;`\n",
"\n",
"<img src='figures/join_results_plot.png', style=\"width: 600px;\">\n",
"For joining, the datasets are joined on datasets' id and the resulting table/DataFrame is returned. The code for this task are\n",
"<br> __pandas__: `df_A.merge(self.df_B, left_on='id', right_on='id')` \n",
"<br> __Postgres__: `SELECT * FROM test_table_A` \n",
"<br> `JOIN test_table_B on test_table_A.id = test_table_B.id;`\n",
"\n",
"<img src='figures/load_results_plot.png', style=\"width: 600px;\">\n",
"For loading a csv file, dataset A is loaded from disk to create a table/DataFrame. The code for this task are \n",
"<br> __pandas__: `df_A = pd.read_csv('PATH_TO_CSV_FILE', header=None, index_col=False, names=self.columns_A)`\n",
"<br> __Postgres__: `COPY test_table_A FROM 'PATH_TO_CSV_FILE' WITH DELIMITER ',';`"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"source": [
"## Conclusions\n",
"\n",
"Overall, pandas outperformed Postgres, often running over five to ten times faster for the larger datasets. The only cases when Postgres performed better were for smaller sized datasets, typically lesss than a thousand rows. Selecting columns was very efficient in pandas, with an O(1) time complexity because the DataFrame is already stored in memory. In general, loading and joining were the tasks that took the longest, requiring times greater than a second for large datasets.\n",
"\n",
"For dataset sizes investigated, pandas is the better tool for the data analysis tasks studied. However, pandas does have its limitations and there is still a need for SQL. For pandas, the data is stored in memory and it will be difficult loading a CSV file greater than half of the system's memory. For the ten-thousand row dataset, the file size was about 400 MB, but the dataset only had four columns. Datasets often contain hundreds of columns, resulting in file sizes on the order of 10 GB when the dataset has over a million rows.\n",
"\n",
"Postgres and pandas are ultimately different tools with overlapping functionality. Postgres and other SQL based languages were created to manage databases and offer users a convenient way to access and retrieve data, especially across multiple tables. The server running Postgres would have all the datasets stored as tables across the system, and it would be impractical for a user to transfer the required tables to their system and use pandas to perform tasks such as join and group by client side. Pandas was created for data manipulation and its strength lies in complex data analysis operations. One should not view pandas and Postgres as competing entities but rather important tools making up the Data Science computational stack."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
71 changes: 71 additions & 0 deletions pandas_vs_PostgreSQL/benchmark_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
import re
from contexttimer import Timer
from pandas_tasks import PandasTasks
from postgres_tasks import PostgresTasks


def run_test(tool, csv_file_A, csv_file_B, N=10):
"""Return dictionary of benchmark results and number of rows in the dataset.


Positional arguments:
tool: tool to use for benchmark (pandas or postgres)
csv_file_A: csv file name to use for DataFrame/table A
csv_file_B: csv file name to use for DataFrame/table B

Keyword arguments:
N: number of test replicates
"""

# define tool to use
if tool.lower() == 'pandas':
tool_task = PandasTasks(csv_file_A, csv_file_B)
elif tool.lower() in ('postgresql', 'postgres', 'psycopg2'):
tool_task = PostgresTasks(csv_file_A, csv_file_B)
else:
raise ValueError("tool must either be pandas or postgres")

# loop through each task
tasks = ('select', 'filter', 'groupby_agg', 'join', 'load')
benchmark_dict = {}
num_rows = int(re.findall(r'\d+', csv_file_A)[0])

for task in tasks:
print "running " + task + " for " + str(num_rows) + " rows using " + tool
task_time = []

for _ in xrange(N):
with Timer() as t:
getattr(tool_task, task)()
task_time.append(t.elapsed)

benchmark_dict[task] = task_time

tool_task.clean_up()

return benchmark_dict, num_rows

if __name__ == '__main__':
import json
import os
import sys

tool = sys.argv[1].lower()
num_reps = int(sys.argv[2])

# get csv files names
files_A = os.listdir('csv/A')
files_B = os.listdir('csv/B')
files_A.sort()
files_B.sort()

result_dict = {}

for f_A, f_B in zip(files_A, files_B):
results, row = run_test(
tool, 'csv/A/' + f_A, 'csv/B/' + f_B, N=num_reps)
result_dict[str(row)] = results

# dump dictionary to json
with open('results/' + tool + '_benchmark.json', 'w') as f:
json.dump(result_dict, f)
29 changes: 29 additions & 0 deletions pandas_vs_PostgreSQL/create_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
import numpy as np
import pandas as pd


def create_csv(n=1000):
seed = np.random.seed(1)

# generate random dataset
columns = ('id', 'section', 'score_1', 'score_2')
labels = ('A', 'B', 'C', 'D')

id = np.random.choice(range(n), n, replace=False)
section = np.random.choice(labels, n)
score_1 = np.random.rand(n)
score_2 = np.random.rand(n)
score_3 = np.random.rand(n)

# create and dump DataFrame to csv
df_A = pd.DataFrame(dict(zip(columns, (id, section, score_1, score_2))))
df_B = pd.DataFrame(dict(zip(('id', 'score_3'), (id, score_3))))
df_A.to_csv('csv/A/test_A_' + str(n) +
'_rows.csv', index=False, header=False)
df_B.to_csv('csv/B/test_B_' + str(n) +
'_rows.csv', index=False, header=False)

if __name__ == '__main__':
import sys

create_csv(int(sys.argv[1]))
34 changes: 34 additions & 0 deletions pandas_vs_PostgreSQL/create_json.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
import json
import os
import re
import sys

csv_files = os.listdir('csv/A/')
log_files = os.listdir('log/')

num_rows = [re.findall(r'\d+', f)[0] for f in csv_files]
results = {}

# loop through each number of rows, loading the appropriate log files for each
# task. The 3rd column of each line of the file is the execution time, in
# milliseconds.

for n in num_rows:
S = '_' + n + '.log'
results_for_row = {}

for f in [log for log in log_files if S in log]:
task = re.findall(r'(_\w+_)', f)[0][1:-1]
time = []

with open('log/' + f, 'r') as file:
for line in file:
time.append(float(line.split(" ")[2]) / 1E6)

results_for_row[task] = time

results[n] = results_for_row

# dump results to JSON
with open('results/postgres_benchmark.json', 'w') as f:
json.dump(results, f)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
38 changes: 38 additions & 0 deletions pandas_vs_PostgreSQL/pandas_tasks.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
import pandas as pd


class PandasTasks(object):

def __init__(self, csv_file_A, csv_file_B):
self.csv_file_A = csv_file_A
self.csv_file_B = csv_file_B
self.columns_A = ('id', 'score_1', 'score_2', 'section')
self.columns_B = ('id', 'score_3')

self.df_A = pd.read_csv(csv_file_A, header=None, index_col=False,
names=self.columns_A)
self.df_B = pd.read_csv(csv_file_B, header=None, index_col=False,
names=self.columns_B)

def load(self):
self.df_A = pd.read_csv(self.csv_file_A, header=None, index_col=False,
names=self.columns_A)

def select(self):
self.df_A['score_1']

def filter(self):
self.df_A[self.df_A['section'] == 'A']

def groupby_agg(self):
self.df_A.groupby('section').agg({'score_1': 'mean', 'score_2': 'max'})

def join(self):
self.df_A.merge(self.df_B, left_on='id', right_on='id')

def get_num_rows(self):
return len(self.df_A)

def clean_up(self):
del(self.df_A)
del(self.df_B)
53 changes: 53 additions & 0 deletions pandas_vs_PostgreSQL/pgbench_queries.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#!/bin/bash
#
# Run PostgreSQL benchmark using pgbench. The only positional parameter is the
# the number of benchmark replicates.

# initialize directories and remove old pgbench_log files
if [ ! -d log ]; then
mkdir log
fi

if [ -e pgbench_log.* ]; then
rm pgbench_log*
fi

# loop through all csv files
for file_A in csv/A/*; do

# initialize table and variables
psql -U $USER -d $USER -f queries/init.sql > /dev/null
num_rows=$(echo $file_A | grep -oP '\d+')

# create and run loading csv query files
file_B="csv/B/test_B_"$num_rows"_rows.csv"
query_A="COPY test_table_A FROM ""'""$PWD"/"$file_A""'"" WITH DELIMITER ',';"
query_B="COPY test_table_B FROM ""'""$PWD"/"$file_B""'"" WITH DELIMITER ',';"

echo "DELETE FROM test_table_A;" > queries/load_all.sql
echo "DELETE FROM test_table_B;" >> queries/load_all.sql
echo "$query_A" >> queries/load_all.sql
echo "$query_B" >> queries/load_all.sql

echo "DELETE FROM test_table_A;" > queries/load.sql
echo "$query_A" >> queries/load.sql

psql -U $USER -d $USER -f queries/load_all.sql > /dev/null

# benchmark each task
for task in select filter groupby_agg join load; do
echo "running "$task" for "$num_rows" rows using Postgres"
pgbench -ln -t $1 -f queries/$task.sql > /dev/null
mv pgbench_log* log/pgbench_$task"_"$num_rows".log"
done

# drop tables
psql -U $USER -d $USER -f queries/clean_up.sql
done

# format results from logs and clean up
python create_json.py

if [ -d log ]; then
rm -rf log
fi
Loading