Skip to content
This repository has been archived by the owner on Jul 16, 2024. It is now read-only.

Commit

Permalink
Add doc for server requirements (#227)
Browse files Browse the repository at this point in the history
As more (experimental) new features are added, it is getting more and
more complex to setup a server to meet the requirements.

To make the process easier, this patch adds a page `requirements.rst`,
to specify the requirements for client and for server. It is for all
the features including the experimental ones.

This patch also makes minor adjustments to make the interface of the
experimental features more clear and easier to work with.
  • Loading branch information
xuebinsu authored Dec 21, 2023
1 parent 2147a59 commit 3c2b062
Show file tree
Hide file tree
Showing 14 changed files with 184 additions and 72 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
<img src="./doc/images/gppython_logo_text.svg">

GreenplumPython is a Python library that enables the user to interact with Greenplum in a Pythonic way.
GreenplumPython is a Python library that enables the user to interact with database in a Pythonic way.

GreenplumPython provides a [pandas](https://pandas.pydata.org/)-like DataFrame API that
1. looks familiar and intuitive to Python users
Expand Down
1 change: 0 additions & 1 deletion doc/source/db.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
Database
========
.. module:: greenplumpython

.. automodule:: db
:members:
Expand Down
3 changes: 2 additions & 1 deletion doc/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ There are explanations about the implementation and examples.
:maxdepth: 2
:caption: Contents:

install
req
req_advanced
tutorials
modules
54 changes: 0 additions & 54 deletions doc/source/install.rst

This file was deleted.

4 changes: 2 additions & 2 deletions doc/source/notebooks/embedding.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# (Experimental) Generating, Indexing and Searching Embeddings\n",
"# Generating, Indexing and Searching Embeddings (Experimental)\n",
"\n",
"**WARNING: The feature introduced in this tutorial is currently experimental. It does not have any API stability guarantee.**\n",
"\n",
Expand Down Expand Up @@ -314,7 +314,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
"version": "3.9.18"
}
},
"nbformat": 4,
Expand Down
12 changes: 7 additions & 5 deletions doc/source/notebooks/package.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# (Experimental) Installing Python Packages on Server without Internet Access\n",
"# Installing Python Packages on Server without Internet (Experimental)\n",
"\n",
"**WARNING: The feature introduced in this tutorial is currently experimental. It does not have any API stability guarantee.**\n",
"\n",
Expand All @@ -18,26 +18,28 @@
"\n",
"All these happen automatically and the user only need to declare what packages are needed.\n",
"\n",
"In this way, as long as there is a database connection on a client with Internet access, the user can easily install the required packages, even if the database server cannot access the Internet by itself."
"In this way, as long as there is a database connection on a client with Internet access, the user can easily install the required packages, even if the database server cannot access the Internet by itself.\n",
"\n",
"**NOTE: This function only installs packages on the server host that GreenplumPython directly connects to. If your database server spreads across multiple hosts, additional operations are required to make the packages available on all hosts.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## (Optional) Prerequisite: Setting-up an NFS Mount for Cluster\n",
"## (Optional) Prerequisite: Sharing Python Environments in a Cluster with NFS\n",
"\n",
"Setting up a NFS mount makes it easier to share a Python environment on multiple hosts and containers.\n",
"\n",
"This is important for distributed database systems such as [Greenplum](https://greenplum.org/) because otherwise the same set of pcakges need to be installed on every host in the cluster.\n",
"This is important for distributed database systems such as [Greenplum](https://greenplum.org/) because otherwise the same set of packages needs to be copied to every host in the cluster.\n",
"\n",
"### Starting an NFS server\n",
"\n",
"First, we need to install and start an NFS server on one host. As an example, for Greenplum, we can start it on the coordinator host.\n",
"\n",
"For how to do this, please refer to the documentation of the OS. For example, if you are using [Rocky Linux](https://rockylinux.org/), you might want to refer to [the NFS page](https://docs.rockylinux.org/guides/file_sharing/nfsserver/).\n",
"\n",
"### Mount a Python environment with NFS\n",
"### Mount a Python environment with NFS on Each Host\n",
"\n",
"Next, we can mount a Python environment with NFS and share it to all hosts in the cluster. \n",
"\n",
Expand Down
2 changes: 1 addition & 1 deletion doc/source/op.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Operators and Indexes
======================
.. module:: greenplumpython
module:: greenplumpython

.. automodule:: op
:members:
Expand Down
61 changes: 61 additions & 0 deletions doc/source/req.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
Requirements
============

On Client
---------

On the client side, e.g., on our laptop or workstation, installing the `greenplum-python` Python package is all we need:

.. code-block:: bash
python3 -m pip install greenplum-python
This installs the latest released version. To try the latest development version, we can install it with

.. code-block:: bash
python3 -m pip3 install git+https://github.com/greenplum-db/GreenplumPython
Please note that the Python version needs to be at least 3.9 to install.

On Server
---------

GreenplumPython works best with Greenplum. All features will be developed and tested on Greenplum first.

We also try our best to support PostgreSQL and other PostgreSQL-derived databases, but some features might **NOT** be available when working with them.

.. _Getting Started:

Getting Started
^^^^^^^^^^^^^^^

To get started, all we need is a database that we have the permission to access.

After connecting to the database, we can create :class:`~dataframe.DataFrame` s and manipulate them like using `pandas <https://pandas.pydata.org/>`_.

.. _Creating Functions:

Creating Functions
^^^^^^^^^^^^^^^^^^

Even though we can call existing functions in database to manipulate DataFrames, sometimes they might not fit our needs and we need to create new UDFs.

To create a UDF, we need to install the PL/Python package on server and enable it in database with SQL:

.. code-block:: sql
CREATE EXTENSION plpython3u;
There are a few points to note when working with PL/Python:

- To use the extension :code:`plpython3u`, it is required to login as a :code:`SUPERUSER`.
This might cause some security concerns. We will remove this limitation soon by supporting
`PL/Container <https://github.com/greenplum-db/plcontainer>`_.
- Python 3.x is required on server. And it is recommended that the Python version on server
is greater than or equal to the one on client. This is to ensure all Python features are available
when writing UDFs.

With all above steup, we are ready to go through the :doc:`tutorial <./sql>` to see how GreenplumPython compares with SQL.

For other, more advanced, features, please refer to :doc:`./req_advanced`.
70 changes: 70 additions & 0 deletions doc/source/req_advanced.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
Requirements on Server for Advanced Features
============================================

Using Non-Built-in Modules in a UDF
-----------------------------------

Modules installed in :code:`sys.path` on server will be available for use in a UDF. It is recommended to use a dedicated virtual
environment on server for UDFs. To achieve this, one way is to activate the environment before starting the database server.
For example, for PostgreSQL:

.. code-block:: bash
python3 -m venv /path/to/venv
source /path/to/venv/bin/activate
pg_ctl start
In this way, UDFs executed in the PostgreSQL server can only use packages installed in the new virutal environment. This avoids
polluting, or being polluted by, the system environment.

Defining Classes and Functions Outside UDFs
-------------------------------------------

GreenplumPython will use the `dill` pickler to serialize and deserialize UDFs if it is available.
Using a pickler like `dill` makes UDFs easier to write and to maintain because it allows us to refer to a function or class
defined outside of the UDF. This means we don't need to copy it around. To use dill, we need to

- Make sure that the Python minor version on client equals to the one on server;
- Make sure that the dill version on server is no less than the one on client, based on
`dill's statement <https://github.com/uqfoundation/dill/issues/272#issuecomment-400843077>`_ on backward compatibility.

With all in

- the `Using Non-Built-in Modules in a UDF`_ section and
- the `Defining Classes and Functions Outside UDFs`_ section

setup, we are now ready to go though the :doc:`tutorial <./abalone>` on how to do Machine Learning (ML) in database with UDFs.

Creating and Searching Embeddings (Experimental)
------------------------------------------------

Embeddings enable us to search unstructured data, e.g. texts and images, based on semantic similarity.

To create and search embeddings, we will need all in :doc:`./req`, plus the
`sentence-transformers <https://pypi.org/project/sentence-transformers/>`_ package installed
in the server's Python environment.

Please refer to the :doc:`tutorial <./tutorial_embedding>` for a simple working example to validate your setup.

Uploading Data Files from Localhost (Experimental)
--------------------------------------------------

With GreenplumPython, we can upload data files of any format from localhost to server and parse them with a UDF.

This feature requires all in :doc:`./req` to create UDFs.

Please refer to the doc of :meth:`DataFrame.from_files() <dataframe.DataFrame.from_files>` for detailed usage.

Installing Python Packages (Experimental)
-----------------------------------------

With GreenplumPython, we can upload packages from localhost and install them on server.

This can greatly simplify the process when the server cannot access the PyPI service directly.

Since the installation is done by executing a UDF on server, this feature requires all in :doc:`./req`.

Please refer to

- the doc of :meth:`Database.install_packages() <db.Database.install_packages>` for detailed usage, and
- the :doc:`tutorial <./tutorial_package>` for a simple working example.
2 changes: 1 addition & 1 deletion doc/source/tutorial_embedding.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@

.. toctree::
:maxdepth: 2
:caption: (Experimental) Generating, Indexing and Searching Embeddings
:caption: Generating, Indexing and Searching Embeddings (Experimental)

notebooks/embedding
2 changes: 1 addition & 1 deletion doc/source/tutorial_package.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@

.. toctree::
:maxdepth: 2
:caption: (Experimental) Installing Python Packages on Server without Internet Access
:caption: Installing Python Packages on Server without Internet (Experimental)

notebooks/package
14 changes: 12 additions & 2 deletions greenplumpython/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -1230,7 +1230,12 @@ def embedding(self) -> "Embedding":
"""
Enable embedding-based similarity search on columns of the current :class:`~DataFrame`.
See :ref:`tutorial-embedding` for more details.
Example:
See :ref:`tutorial-embedding` for more details.
Warning:
This function is currently **experimental** and the interface is
subject to change.
"""
raise NotImplementedError(
"Please import greenplumpython.experimental.embedding to load the implementation."
Expand All @@ -1242,14 +1247,19 @@ def from_files(cls, files: list[str], parser: "NormalFunction", db: Database) ->
Create a DataFrame with data read from files.
Args:
files: list of file paths.
files: list of file paths. Each path ends with the path of the
same file on client, without links resolved.
parser: a UDF that parses the given files on server. The UDF is required to
- take the file path as its only argument and
- returns a set of parsed records in the returing DataFrame.
db: Database that the DataFrame to be created in.
Returns:
DataFrame containing the parsed data from the given files.
Warning:
This function is currently **experimental** and the interface is
subject to change.
"""
raise NotImplementedError(
"Please import greenplumpython.experimental.file to load the implementation."
Expand Down
14 changes: 14 additions & 0 deletions greenplumpython/db.py
Original file line number Diff line number Diff line change
Expand Up @@ -257,6 +257,20 @@ def install_packages(self, requirements: str) -> None:
Example:
See :ref:`tutorial-package` for more details.
Note:
This function only installs packages on the server host that
GreenplumPython directly connects to. If your database server
spreads across multiple hosts, additional operations are required
to make the packages available on all hosts.
One simple way to achieve this is to setup an NFS share on all
hosts. Please refer to :ref:`tutorial-package` for a simple working
example.
Warning:
This function is currently **experimental** and the interface is
subject to change.
"""
raise NotImplementedError(
"Please import greenplumpython.experimental.file to load the implementation."
Expand Down
15 changes: 12 additions & 3 deletions greenplumpython/experimental/file.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,12 +35,12 @@ def _extract_files(tmp_archive_name: str, returning: str) -> list[str]:
tmp_archive.extractall(str(extracted_root))
tmp_archive_path.unlink()
if returning == "root":
yield str(extracted_root.resolve())
yield str(extracted_root)
else:
assert returning == "files"
for path in extracted_root.rglob("*"):
if path.is_file() and not path.is_symlink():
yield str(path.resolve())
yield str(path)


def _archive_and_upload(tmp_archive_name: str, files: list[str], db: gp.Database):
Expand Down Expand Up @@ -98,6 +98,15 @@ def _install_on_server(pkg_dir: str, requirements: str) -> str:
import sys

assert sys.executable, "Python executable is required to install packages."
try:
exec_version = sp.check_output([sys.executable, "--version"], text=True, stderr=sp.STDOUT)
except sp.CalledProcessError as e:
raise Exception(e.stdout)

lib_version = f"Python {sys.version_info.major}.{sys.version_info.minor}."
assert exec_version.startswith(
lib_version
), f"Python major and minor versions mismatch (executable {exec_version}, library {lib_version})"
cmd = [
sys.executable,
"-m",
Expand Down Expand Up @@ -135,7 +144,7 @@ def _install_packages(db: gp.Database, requirements: str):
sp.check_output(cmd, text=True, stderr=sp.STDOUT, input=requirements)
except sp.CalledProcessError as e:
raise e from Exception(e.stdout)
_archive_and_upload(tmp_archive_name, [local_dir.resolve()], db)
_archive_and_upload(tmp_archive_name, [local_dir], db)
extracted = db.apply(lambda: _extract_files(tmp_archive_name, "root"), column_name="cache_dir")
assert len(list(extracted)) == 1
server_dir = (
Expand Down

0 comments on commit 3c2b062

Please sign in to comment.