Skip to content

Commit

Permalink
V0.12.2 (#96)
Browse files Browse the repository at this point in the history
* JOSS: Badge

* Docs: Runs with TimescaleDB

* Dashboard: Color by number of experiment run

* Dashboard: Client from Microservices

* Joss: Statement of need clearer and DOIs corrected

* Joss: Statement of need - DBs
  • Loading branch information
perdelt authored Aug 1, 2022
1 parent 7459ed5 commit a5d7ff4
Show file tree
Hide file tree
Showing 8 changed files with 168 additions and 16 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
[![PyPI version](https://badge.fury.io/py/dbmsbenchmarker.svg)](https://badge.fury.io/py/dbmsbenchmarker)
[![.github/workflows/draft-pdf.yml](https://github.com/Beuth-Erdelt/DBMS-Benchmarker/actions/workflows/draft-pdf.yml/badge.svg)](https://github.com/Beuth-Erdelt/DBMS-Benchmarker/actions/workflows/draft-pdf.yml)
[![DOI](https://zenodo.org/badge/213186578.svg)](https://zenodo.org/badge/latestdoi/213186578)
[![JOSS](https://joss.theoj.org/papers/82d2fa62b5c3ec30014f6307cc731cdd/status.svg)](https://joss.theoj.org/papers/82d2fa62b5c3ec30014f6307cc731cdd)


# DBMS-Benchmarker
Expand Down Expand Up @@ -41,7 +42,7 @@ DBMS-Benchmarker
For more informations, see a [basic example](#basic-usage) or take a look in the [documentation](https://dbmsbenchmarker.readthedocs.io/en/latest/Docs.html) for a full list of options.

The code uses several Python modules, in particular <a href="https://github.com/baztian/jaydebeapi" target="_blank">jaydebeapi</a> for handling DBMS.
This module has been tested with Clickhouse, DB2, Exasol, Hyperscale (Citus), Kinetica, MariaDB, MariaDB Columnstore, MemSQL, Mariadb, MonetDB, MySQL, OmniSci, Oracle DB, PostgreSQL, SingleStore, SQL Server, SAP HANA and Vertica.
This module has been tested with Clickhouse, DB2, Exasol, Hyperscale (Citus), Kinetica, MariaDB, MariaDB Columnstore, MemSQL, Mariadb, MonetDB, MySQL, OmniSci, Oracle DB, PostgreSQL, SingleStore, SQL Server, SAP HANA, TimescaleDB and Vertica.

## Installation

Expand Down
5 changes: 4 additions & 1 deletion dashboard.py
Original file line number Diff line number Diff line change
Expand Up @@ -166,13 +166,16 @@ def get_connections_by_filter(filter_by: str, e: inspector.inspector) -> dict:
elif filter_by == 'GPU':
connections_by_filter = e.get_experiment_list_connections_by_hostsystem('GPU')
elif filter_by == 'Client':
connections_by_filter = e.get_experiment_list_connections_by_connectionmanagement('numProcesses')
connections_by_filter = e.get_experiment_list_connections_by_parameter('client')
#connections_by_filter = e.get_experiment_list_connections_by_connectionmanagement('numProcesses')
elif filter_by == 'CPU':
connections_by_filter = e.get_experiment_list_connections_by_hostsystem('CPU')
elif filter_by == 'CPU Limit':
connections_by_filter = e.get_experiment_list_connections_by_hostsystem('limits_cpu')
elif filter_by == 'Docker Image':
connections_by_filter = e.get_experiment_list_connections_by_parameter('dockerimage')
elif filter_by == 'Experiment Run':
connections_by_filter = e.get_experiment_list_connections_by_parameter('numExperiment')
else:
raise KeyError('filter_by')

Expand Down
2 changes: 1 addition & 1 deletion dbmsbenchmarker/layout.py
Original file line number Diff line number Diff line change
Expand Up @@ -427,7 +427,7 @@ def serve_layout(preview) -> html.Div:
html.Label('Color by', id='label_graph_colorby'),
dcc.Dropdown(
id='dd_graph_colorby',
options=[{'label': x, 'value': x} for x in ['DBMS', 'Node', 'Script', 'CPU Limit', 'Client', 'GPU', 'CPU', 'Docker Image']],
options=[{'label': x, 'value': x} for x in ['DBMS', 'Node', 'Script', 'CPU Limit', 'Client', 'GPU', 'CPU', 'Docker Image', 'Experiment Run']],
),

html.Label('Boxpoints', id='label_boxpoints'),
Expand Down
5 changes: 4 additions & 1 deletion dbmsbenchmarker/scripts/dashboardcli.py
Original file line number Diff line number Diff line change
Expand Up @@ -168,13 +168,16 @@ def get_connections_by_filter(filter_by: str, e: inspector.inspector) -> dict:
elif filter_by == 'GPU':
connections_by_filter = e.get_experiment_list_connections_by_hostsystem('GPU')
elif filter_by == 'Client':
connections_by_filter = e.get_experiment_list_connections_by_connectionmanagement('numProcesses')
connections_by_filter = e.get_experiment_list_connections_by_parameter('client')
#connections_by_filter = e.get_experiment_list_connections_by_connectionmanagement('numProcesses')
elif filter_by == 'CPU':
connections_by_filter = e.get_experiment_list_connections_by_hostsystem('CPU')
elif filter_by == 'CPU Limit':
connections_by_filter = e.get_experiment_list_connections_by_hostsystem('limits_cpu')
elif filter_by == 'Docker Image':
connections_by_filter = e.get_experiment_list_connections_by_parameter('dockerimage')
elif filter_by == 'Experiment Run':
connections_by_filter = e.get_experiment_list_connections_by_parameter('numExperiment')
else:
raise KeyError('filter_by')

Expand Down
22 changes: 22 additions & 0 deletions docs/DBMS.md
Original file line number Diff line number Diff line change
Expand Up @@ -345,6 +345,28 @@ JDBC driver: https://mariadb.com/kb/en/about-mariadb-connector-j/
]
```

## TimescaleDB

https://www.timescale.com/

JDBC driver: https://jdbc.postgresql.org/

```
[
{
'name': 'TimescaleDB',
'info': 'This is a demo of TimescaleDB',
'active': True,
'JDBC': {
'driver': 'org.postgresql.Driver',
'url': 'jdbc:postgresql://localhost:5432/database',
'auth': ['user', 'password'],
'jar': 'jars/postgresql-42.2.5.jar'
}
},
]
```

## Vertica

https://www.vertica.com/
Expand Down
2 changes: 1 addition & 1 deletion layout.py
Original file line number Diff line number Diff line change
Expand Up @@ -427,7 +427,7 @@ def serve_layout(preview) -> html.Div:
html.Label('Color by', id='label_graph_colorby'),
dcc.Dropdown(
id='dd_graph_colorby',
options=[{'label': x, 'value': x} for x in ['DBMS', 'Node', 'Script', 'CPU Limit', 'Client', 'GPU', 'CPU', 'Docker Image']],
options=[{'label': x, 'value': x} for x in ['DBMS', 'Node', 'Script', 'CPU Limit', 'Client', 'GPU', 'CPU', 'Docker Image', 'Experiment Run']],
),

html.Label('Boxpoints', id='label_boxpoints'),
Expand Down
113 changes: 112 additions & 1 deletion paper.bib
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,8 @@ @conference{Kluyver2016jupyter
Editor = {F. Loizides and B. Schmidt},
Organization = {IOS Press},
Pages = {87 - 90},
Year = {2016}
Year = {2016},
doi = {10.3233/978-1-61499-649-1-87}
}

@Article{Hunter:2007,
Expand Down Expand Up @@ -213,3 +214,113 @@ @book{KounevLK20
biburl = {https://dblp.org/rec/books/sp/KounevLK20.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

@book{series/utcs/IgualS17,
added-at = {2018-11-02T00:00:00.000+0100},
author = {Igual, Laura and Seguí, Santi},
biburl = {https://www.bibsonomy.org/bibtex/2df957256a471cf3626c5a280424a2b9a/dblp},
ee = {https://doi.org/10.1007/978-3-319-50017-1},
doi = {10.1007/978-3-319-50017-1},
interhash = {1f6d348e064db5ee3a118549b35cbb04},
intrahash = {df957256a471cf3626c5a280424a2b9a},
isbn = {978-3-319-50017-1},
keywords = {dblp},
pages = {1-215},
publisher = {Springer},
series = {Undergraduate Topics in Computer Science},
timestamp = {2018-11-03T12:45:44.000+0100},
title = {Introduction to Data Science - A Python Approach to Concepts, Techniques and Applications},
year = 2017
}

@article{Waskom2021,
doi = {10.21105/joss.03021},
url = {https://doi.org/10.21105/joss.03021},
year = {2021},
publisher = {The Open Journal},
volume = {6},
number = {60},
pages = {3021},
author = {Michael L. Waskom},
title = {seaborn: statistical data visualization},
journal = {Journal of Open Source Software}
}

@misc{TIOBE,
author = {TIOBE},
title = {{TIOBE Index - TIOBE}},
journal = {TIOBE},
year = {2022},
month = jun,
note = {[Online; accessed 31. Jul. 2022]},
url = {https://www.tiobe.com/tiobe-index}
}

@misc{PYPL,
author = {PYPL},
title = {{PYPL PopularitY of Programming Language index}},
year = {2022},
month = jul,
note = {[Online; accessed 31. Jul. 2022]},
url = {https://pypl.github.io/PYPL.html}
}

@inproceedings{10114533389063338912,
author = {He, Sen and Manns, Glenna and Saunders, John and Wang, Wei and Pollock, Lori and Soffa, Mary Lou},
title = {A Statistics-Based Performance Testing Methodology for Cloud Applications},
year = {2019},
isbn = {9781450355728},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3338906.3338912},
booktitle = {Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering},
pages = {188--199},
numpages = {12},
keywords = {cloud computing, resource contention, performance testing, non- parametric statistics},
location = {Tallinn, Estonia},
series = {ESEC/FSE 2019},
doi = {10.1145/3338906.3338912},
}

@inproceedings{DBLPconfsigmodKerstenKZ18,
author = {Martin L. Kersten and
Panagiotis Koutsourakis and
Ying Zhang},
editor = {Alexander B{\"{o}}hm and
Tilmann Rabl},
title = {Finding the Pitfalls in Query Performance},
booktitle = {Proceedings of the 7th International Workshop on Testing Database
Systems, DBTest@SIGMOD 2018, Houston, TX, USA, June 15, 2018},
pages = {3:1--3:6},
publisher = {{ACM}},
year = {2018},
url = {https://doi.org/10.1145/3209950.3209951},
doi = {10.1145/3209950.3209951},
timestamp = {Mon, 12 Aug 2019 13:49:51 +0200},
biburl = {https://dblp.org/rec/conf/sigmod/KerstenKZ18.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}


@misc{DBDBIO,
author = {{Carnegie Mellon Database Group}},
title = {{Database of Databases}},
journal = {Database of Databases},
year = {2022},
month = aug,
note = {[Online; accessed 1. Aug. 2022]},
url = {https://dbdb.io}
}

@misc{DBEngines,
author = {{solid IT GmbH}},
title = {{DB-Engines Ranking}},
journal = {DB-Engines},
year = {2022},
month = aug,
note = {[Online; accessed 1. Aug. 2022]},
url = {https://db-engines.com/en/ranking}
}



32 changes: 22 additions & 10 deletions paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,26 +28,38 @@ Queries can be parametrized and randomized.
Results and evaluations are available via a Python interface and can be inspected with standard Python tools like pandas DataFrames.
An interactive visual dashboard assists in multi-dimensional analysis of the results.

This module has been tested with Clickhouse, Exasol, Citus Data (Hyperscale), IBM DB2, MariaDB, MariaDB Columnstore, MemSQL (SingleStore), MonetDB, MySQL, OmniSci (HEAVY.AI) , Oracle DB, PostgreSQL, SQL Server, SAP HANA and Vertica.
This module has been tested with Clickhouse, Exasol, Citus Data (Hyperscale), IBM DB2, MariaDB, MariaDB Columnstore, MemSQL (SingleStore), MonetDB, MySQL, OmniSci (HEAVY.AI) , Oracle DB, PostgreSQL, SQL Server, SAP HANA, TimescaleDB and Vertica.

See the [homepage](https://github.com/Beuth-Erdelt/DBMS-Benchmarker) and the [documentation](https://dbmsbenchmarker.readthedocs.io/en/latest/Docs.html).

# Statement of Need

Benchmarking of database management systems (DBMS) is an active research area.
Performance benchmarking of database management systems (DBMS) is an active research area and has a broad audience. It is used *by DBMS developers to evaluate their work and to find out which algorithm works best in which situation. Benchmarks are used by (potential) customers to evaluate what system or hardware to buy or rent. Benchmarks are used by administrators to find bottlenecks and adjust configurations. Benchmarks are used by users to compare semantically equivalent queries and to find the best formulation alternative*, @10.1007/978-3-030-84924-5_6.
Also in the academic field, approaches and their special implementations are examined in benchmarks.
There are a variety of DBMS and a lot of products.
For example @DBEngines ranks 350 DBMS (150 Relational), @DBDBIO lists 850 DBMS (280 Relational).
We focus on Relational DBMS (RDBMS) in the following.
The types thereof can be divided into for example row-wise, column-wise, in-memory, distributed and GPU-enhanced.
All of these products have unique characteristics, special use cases, advantages and disadvantages and their justification.
In order to be able to verify and ensure the performance measurement, we want to be able to create and repeat benchmarking scenarios.
Repetition and thorough evaluation are crucial, in particular in the age of Cloud-based systems with it's diversity of hardware configurations, @Raasveldt2018FBC32099503209955, @DBLPconfsigmodKerstenKZ18, @KounevLK20.

There is need for a tool to support the repetition and reproducibility of benchmarking situations, and that is capable of connecting to all these systems.
There is also need for a tool that will help with the statistical and interactive analysis of the results.
We want to use Python as the common Data Science language.
This helps to implement the tool into a pipeline.
Moreover this allows to use common and sophisticated tools to inspect and evaluate the results, like pandas, c.f. @reback2020pandas, @mckinney-proc-scipy-2010, Jupyter notebooks, c.f. @Kluyver2016jupyter, matplotlib, c.f. @Hunter:2007, SciPy, c.f. @2020SciPy-NMeth, or even Machine Learning tools.

To our knowledge there is no other such tool, c.f. @10.1007/978-3-319-67162-8_12, @10.1007/978-3-030-12079-5_4.
Thus there is a widespread need for a tool to support the repetition and reproducibility of benchmarking situations, and that is capable of connecting to all these systems.

When we collect a lot of data during benchmarking processes, we also need a tool that will help with the statistical, visual and interactive analysis of the results.
The authors advocate using Python as a common Data Science language, since
*it is a mature language programming, easy for the newbies, and can be used as a specific platform for data scientists, thanks to its large ecosystem of scientific libraries and its high and vibrant community*, @series/utcs/IgualS17.
This helps to implement the tool into a pipeline, for example to make use of closed-loop benchmarking situations, @10114533389063338912, or to closely inspect parts of queries, @DBLPconfsigmodKerstenKZ18.
It also allows to use common and sophisticated tools to inspect and evaluate the results.
To name a few:
Pandas for statistical evaluation of tabular data, @reback2020pandas, @mckinney-proc-scipy-2010,
Scipy for scientific investigation of data, @2020SciPy-NMeth,
IPython and Jupyter notebooks for interactive analysis and display of results, @Kluyver2016jupyter,
Matplotlib and Seaborn for visual analysis, @Hunter:2007, @Waskom2021,
or even Machine Learning tools.
Moreover Python is currently the most popular computer language, @PYPL, @TIOBE.

To our knowledge there is no other such tool, c.f. the studies in @10.1007/978-3-319-67162-8_12 and @10.1007/978-3-030-12079-5_4.
There are other tools like Apache JMeter, HammerDB, Sysbench, OLTPBench, that provide very nice features, but none fitting these needs.
The design decisions of this tool have been elaborated in more detail in @10.1007/978-3-030-84924-5_6.
DBMS-Benchmarker has been used as a support for recieving scientific results about benchmarking DBMS performance in Cloud environments as in @10.1007/978-3-030-84924-5_6 and @10.1007/978-3-030-94437-7_6.
Expand All @@ -64,7 +76,7 @@ DBMS-Benchmarker is Python3-based and helps to **benchmark DBMS**. It
* allows randomized queries to avoid caching side effects
* investigates a number of timing aspects
* investigates a number of other aspects - received result sets, precision, number of clients
* collects hardware metrics from a Prometheus server, c.f. @208870
* collects hardware metrics from a Prometheus server, @208870

DBMS-Benchmarker helps to **evaluate results** - by providing

Expand Down

0 comments on commit a5d7ff4

Please sign in to comment.