Skip to content

Commit

Permalink
[turbodbc] Add example "Using CrateDB with turbodbc"
Browse files Browse the repository at this point in the history
  • Loading branch information
amotl committed Feb 21, 2023
1 parent 748bbb5 commit c68cabe
Show file tree
Hide file tree
Showing 12 changed files with 374 additions and 0 deletions.
1 change: 1 addition & 0 deletions by-language/python-turbodbc/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.venv*
157 changes: 157 additions & 0 deletions by-language/python-turbodbc/README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
###########################
Using CrateDB with turbodbc
###########################


*****
About
*****

This section of the documentation describes how to connect to `CrateDB`_
with `turbodbc`_, by providing a few example programs.

The examples use the `unixODBC`_ implementation of `ODBC`_, and the `PostgreSQL
ODBC driver`_, for connecting to the `PostgreSQL wire protocol`_ interface of
`CrateDB`_.

This folder also contains ``Dockerfile`` files providing environments to
exercise the examples on different operating systems, like Arch Linux,
Red Hat (CentOS), Debian, and SUSE Linux.


************
Introduction
************

`Turbodbc`_ is a Python module to access relational databases via the `Open
Database Connectivity (ODBC)`_ interface. In addition to complying with
the `Python Database API Specification 2.0`_, turbodbc offers built-in `NumPy`_
and `Apache Arrow`_ support for improved performance. Their slogan is:

Don’t wait minutes for your results, just blink.

*Note: The description texts have been taken from turbodbc's documentation 1:1.*

Description
===========

Its primary target audience are data scientist that use databases for which no
efficient native Python drivers are available.

For maximum compatibility, turbodbc complies with the `Python Database API
Specification 2.0`_ (PEP 249). For maximum performance, turbodbc internally
relies on batched data transfer instead of single-record communication as
other popular ODBC modules do.

Why should I use turbodbc instead of other ODBC modules?
========================================================

- Short answer: turbodbc is faster.
- Slightly longer answer: turbodbc is faster, *much* faster if you want to
work with NumPy.
- Medium-length answer: The author has tested turbodbc and pyodbc (probably
the most popular Python ODBC module) with various databases (Exasol,
PostgreSQL, MySQL) and corresponding ODBC drivers. He found turbodbc to be
consistently faster.

Smooth. What is the trick?
==========================

Turbodbc exploits buffering.

- Turbodbc implements both sending parameters and retrieving result sets using
buffers of multiple rows/parameter sets. This avoids round trips to the ODBC
driver and (depending how well the ODBC driver is written) to the database.
- Multiple buffers are used for asynchronous I/O. This allows to interleave
Python object conversion and direct database interaction (see performance
options below).
- Buffers contain binary representations of data. NumPy arrays contain binary
representations of data. Good thing they are often the same, so instead of
converting we can just copy data.


*****
Setup
*****

Install prerequisites
=====================

Arch Linux::

# See `dockerfiles/archlinux.Dockerfile`.

CentOS Stream::

dnf install --enablerepo=crb -y boost-devel g++ postgresql-odbc python3 python3-devel python3-pip unixODBC-devel

Debian::

apt-get install --yes build-essential libboost-dev odbc-postgresql unixodbc-dev

macOS/Homebrew::

brew install psqlodbc unixodbc

SUSE Linux Enterprise Server::

# See `dockerfiles/sles.Dockerfile`.

Install Python sandbox
======================
::

# Create Python virtualenv and install dependency packages.
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade --requirement=requirements-prereq.txt
pip install --upgrade --requirement=requirements.txt --verbose

.. note::

The `turbodbc pip installation documentation`_ says:
Please ``pip install numpy`` before installing turbodbc, because turbodbc
will search for the ``numpy`` Python package at installation/compile time.
If NumPy is not installed, turbodbc will not compile the `NumPy
support`_ features. Similarly, please ``pip install pyarrow`` before
installing turbodbc if you would like to use the `Apache Arrow
support`_.


*****
Usage
*****

Run CrateDB::

docker run --rm -it --publish=4200:4200 --publish=5432:5432 crate \
-Cdiscovery.type=single-node -Ccluster.routing.allocation.disk.threshold_enabled=false

Invoke demo program on workstation::

python demo.py

Exercise demo program using Docker, on different operating systems::

docker build --progress=plain --tag local/python-turbodbc-demo --file=dockerfiles/archlinux.Dockerfile .
docker build --progress=plain --tag local/python-turbodbc-demo --file=dockerfiles/centos.Dockerfile .
docker build --progress=plain --tag local/python-turbodbc-demo --file=dockerfiles/debian.Dockerfile .
docker build --progress=plain --tag local/python-turbodbc-demo --file=dockerfiles/sles.Dockerfile .

docker run --rm -it --volume=$(pwd):/src --network=host local/python-turbodbc-demo python3 /src/demo.py



.. _Apache Arrow: https://en.wikipedia.org/wiki/Apache_Arrow
.. _Apache Arrow support: https://turbodbc.readthedocs.io/en/latest/pages/advanced_usage.html#advanced-usage-arrow
.. _CrateDB: https://github.com/crate/crate
.. _NumPy: https://en.wikipedia.org/wiki/NumPy
.. _NumPy support: https://turbodbc.readthedocs.io/en/latest/pages/advanced_usage.html#advanced-usage-numpy
.. _ODBC: https://en.wikipedia.org/wiki/Open_Database_Connectivity
.. _Open Database Connectivity (ODBC): https://en.wikipedia.org/wiki/Open_Database_Connectivity
.. _PostgreSQL ODBC driver: https://odbc.postgresql.org/
.. _PostgreSQL wire protocol: https://crate.io/docs/crate/reference/en/latest/interfaces/postgres.html
.. _Python Database API Specification 2.0: https://peps.python.org/pep-0249/
.. _turbodbc: https://turbodbc.readthedocs.io/
.. _turbodbc pip installation documentation: https://turbodbc.readthedocs.io/en/latest/pages/getting_started.html#pip
.. _unixODBC: https://www.unixodbc.org/
22 changes: 22 additions & 0 deletions by-language/python-turbodbc/backlog.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#######################
python-turbodbc backlog
#######################

Various items how this little code example can be improved.

- [x] Provide basic example
- [x] Insert multiple records using parameters
- [x] Docs: Add installation on SUSE
- [x] Provide example(s) for different operating systems (Linux, macOS)
- [o] Docs: Drop a note about connecting with driver file vs. connecting via DSN
- [o] Evaluate different ODBC drivers
- [o] Provide an example scenario how to run it on Windows
- [o] Exercise advanced NumPy and PyArrow options
- [o] Exchange advanced CrateDB data types like ``OBJECT``, ``ARRAY``, and friends
- [o] Use ``SSLmode = Yes`` to connect to CrateDB Cloud
- [o] Explore other driver options at `Zabbix » Recommended UnixODBC settings for PostgreSQL`_
- [o] Check out https://github.com/dirkjonker/sqlalchemy-turbodbc
- [o] Check out https://docs.devart.com/odbc/postgresql/centos.htm


.. _Zabbix » Recommended UnixODBC settings for PostgreSQL: https://www.zabbix.com/documentation/current/en/manual/config/items/itemtypes/odbc_checks/unixodbc_postgresql
69 changes: 69 additions & 0 deletions by-language/python-turbodbc/demo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
import os
import sys

from turbodbc import connect


def demo_pg():
# Connect to database.
# https://turbodbc.readthedocs.io/en/latest/pages/getting_started.html#establish-a-connection-with-your-database

# Either connect per data source name defined within the ODBC configuration,
# connection = connect(dsn="postgresql", server="localhost", database="testdrive", uid="crate", pwd=None)

# or connect per connection string, referencing a driver file directly.
if sys.platform == "linux":
candidates = [
# archlinux
"/usr/lib/psqlodbcw.so",
# Debian
"/usr/lib/x86_64-linux-gnu/odbc/psqlodbcw.so",
# Red Hat
"/usr/lib64/psqlodbcw.so",
]
driver_file = find_program(candidates)
if driver_file is None:
raise ValueError(f"Unable to detect driver file at {candidates}")
elif sys.platform == "darwin":
driver_file = "/usr/local/lib/psqlodbcw.so"
else:
raise NotImplementedError(f"Platform {sys.platform} not supported yet")

connection_string = f"Driver={driver_file};Server=localhost;Port=5432;Database=testdrive;Uid=crate;Pwd=;"
print(f"INFO: Connecting to '{connection_string}'")
connection = connect(connection_string=connection_string)

# Insert data.
cursor = connection.cursor()
cursor.execute("CREATE TABLE IF NOT EXISTS testdrive (id INT PRIMARY KEY, data TEXT);")
cursor.execute("DELETE FROM testdrive;")
cursor.execute("INSERT INTO testdrive VALUES (0, 'zero'), (1, 'one'), (2, 'two');")
cursor.executemany("INSERT INTO testdrive VALUES (?, ?);", [(3, "three"), (4, "four"), (5, "five")])
cursor.execute("REFRESH TABLE testdrive;")
cursor.close()

# Query data.
cursor = connection.cursor()
cursor.execute("SELECT * FROM testdrive ORDER BY id")

print("Column metadata:")
print(cursor.description)

print("Results by row:")
for row in cursor:
print(row)

cursor.close()

# Terminate database connection.
connection.close()


def find_program(candidates):
for candidate in candidates:
if os.path.exists(candidate):
return candidate


if __name__ == "__main__":
demo_pg()
48 changes: 48 additions & 0 deletions by-language/python-turbodbc/dockerfiles/archlinux.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# ---------------------------
# Setup archlinux environment
# ---------------------------

# Include `yay` for easily installing AUR packages.

FROM archlinux:base-20230205.0.123931 as archlinux-build

# Allow building packages using `makepkg` within Docker container.
# https://blog.ganssle.io/tag/arch-linux.html
RUN pacman -Sy --noconfirm --needed base-devel binutils fakeroot git sudo
RUN useradd --create-home build
RUN echo 'build ALL=NOPASSWD: ALL' >> /etc/sudoers

# Install AUR package helper program `yay`.
# https://aur.archlinux.org/packages/yay
RUN mkdir /yay-bin; chmod ugo+rwX /yay-bin
USER build
RUN \
git clone https://aur.archlinux.org/yay-bin.git && \
cd yay-bin && \
makepkg -si --noconfirm
USER root


# --------------------------
# Setup turbodbc environment
# --------------------------

# Install Python, unixODBC, PostgreSQL ODBC driver, and turbodbc.

FROM archlinux-build

# Install unixODBC.
# https://archlinux.org/packages/core/x86_64/unixodbc/
RUN pacman -Sy --noconfirm --needed unixodbc

# Install PostgreSQL ODBC driver.
# https://aur.archlinux.org/packages/psqlodbc
USER build
RUN yay -S --noconfirm psqlodbc
USER root

# Install NumPy, PyArrow, and turbodbc.
RUN pacman -Sy --noconfirm --needed boost python python-pip python-setuptools
ADD requirements*.txt .
RUN pip install --upgrade --requirement=requirements-prereq.txt
RUN MAKEFLAGS="-j$(nproc)" pip install --upgrade --requirement=requirements.txt --verbose
9 changes: 9 additions & 0 deletions by-language/python-turbodbc/dockerfiles/centos.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
FROM quay.io/centos/centos:stream9

# Install Python, unixODBC, the PostgreSQL ODBC driver, and development libraries.
RUN dnf install --enablerepo=crb -y boost-devel g++ postgresql-odbc python3 python3-devel python3-pip unixODBC-devel

# Install Python, NumPy, PyArrow, and turbodbc.
ADD requirements*.txt .
RUN pip install --upgrade --requirement=requirements-prereq.txt
RUN MAKEFLAGS="-j$(nproc)" pip install --upgrade --requirement=requirements.txt --verbose
12 changes: 12 additions & 0 deletions by-language/python-turbodbc/dockerfiles/debian.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
FROM python:3.11-slim-bullseye

ENV DEBIAN_FRONTEND=noninteractive

# Install prerequisites.
RUN apt-get update
RUN apt-get install --yes build-essential libboost-dev odbc-postgresql unixodbc-dev

# Install NumPy, PyArrow, and turbodbc.
ADD requirements*.txt .
RUN pip install --upgrade --requirement=requirements-prereq.txt
RUN MAKEFLAGS="-j$(nproc)" pip install --upgrade --requirement=requirements.txt --verbose
24 changes: 24 additions & 0 deletions by-language/python-turbodbc/dockerfiles/sles.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
FROM registry.suse.com/suse/sle15

# Add package repository for acquiring `boost-devel`.
# https://software.opensuse.org//download.html?project=home%3Afsirl%3Aboost1651&package=boost
RUN zypper addrepo https://download.opensuse.org/repositories/home:fsirl:boost1651/15.4/home:fsirl:boost1651.repo

# Add package repository for acquiring `python310`.
# https://download.opensuse.org/repositories/devel:/languages:/python:/backports/15.4/
RUN zypper addrepo https://download.opensuse.org/repositories/devel:/languages:/python:/backports/15.4/devel:languages:python:backports.repo

# Activate package repositories.
RUN zypper --gpg-auto-import-keys refresh

# Install Python, unixODBC, the PostgreSQL ODBC driver, and development libraries.
RUN zypper install -y boost-devel gcc-c++ psqlODBC python310 python310-devel python310-pip unixODBC-devel update-alternatives

# Make Python 3.10 the default Python 3, and add an alias `python3`.
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.10 0
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 0

# Install Python, NumPy, PyArrow, and turbodbc.
ADD requirements*.txt .
RUN pip install --upgrade --requirement=requirements-prereq.txt
RUN MAKEFLAGS="-j$(nproc)" pip install --upgrade --requirement=requirements.txt --verbose
19 changes: 19 additions & 0 deletions by-language/python-turbodbc/odbc.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# More options:
# https://www.zabbix.com/documentation/current/en/manual/config/items/itemtypes/odbc_checks/unixodbc_postgresql

[postgresql]
Description = General ODBC for PostgreSQL

# General
FileUsage = 1

# If the driver manager was built with thread support, this entry
# alters the default thread serialization level (available since 1.6).
Threading = 2

# Linux
#Driver = /usr/lib64/libodbcpsql.so
#Setup = /usr/lib64/libodbcpsqlS.so

# macOS
Driver = /usr/local/lib/psqlodbcw.so
7 changes: 7 additions & 0 deletions by-language/python-turbodbc/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
[tool.black]
line-length = 120

[tool.isort]
profile = "black"
skip_glob = "**/site-packages/**"
skip_gitignore = false
4 changes: 4 additions & 0 deletions by-language/python-turbodbc/requirements-prereq.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Turbodbc wants NumPy and PyArrow to be installed upfront.
numpy<1.25
pyarrow<11
wheel
2 changes: 2 additions & 0 deletions by-language/python-turbodbc/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
pytest<8
turbodbc<5

0 comments on commit c68cabe

Please sign in to comment.