Skip to content

Commit

Permalink
Add guide on downloading files in parallel
Browse files Browse the repository at this point in the history
Add a new page to the user guide that provides instructions to download
the same file by multiple processes that run in parallel, avoiding
multiple downloads of the same file through the usage of lock files. Add
`filelock` as a requirement for building the docs.
  • Loading branch information
santisoler committed May 24, 2024
1 parent 2e47b8d commit 5d0c140
Show file tree
Hide file tree
Showing 5 changed files with 55 additions and 0 deletions.
1 change: 1 addition & 0 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@
"python": ("https://docs.python.org/3/", None),
"pandas": ("http://pandas.pydata.org/pandas-docs/stable/", None),
"requests": ("https://requests.readthedocs.io/en/latest/", None),
"filelock": ("https://py-filelock.readthedocs.io/en/latest/", None),
}

# Autosummary pages will be generated by sphinx-autogen instead of sphinx-build
Expand Down
1 change: 1 addition & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,7 @@ Are you a **scientist** or researcher? Pooch can help you too!
progressbars.rst
unpacking.rst
decompressing.rst
parallel-downloads.rst

.. toctree::
:caption: Reference
Expand Down
51 changes: 51 additions & 0 deletions doc/parallel-downloads.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
.. _paralleldownloads:

Parallel downloads
==================

When running :func:`pooch.retrieve` or :meth:`pooch.Pooch.fetch` on parallel
processes, Pooch will trigger multiple downloads of the same file(s). Although
there is no `race condition <https://en.wikipedia.org/wiki/Race_condition>`_
happening in this process, download the same file multiple time is not
desirable, it slows down the fetching process and consumes more bandwidth than
necessary.

A solution to this problem is to create a `lock file
<https://en.wikipedia.org/wiki/File_locking#Lock_files>`_ that will allow only
one process to download the desired file, and force all the other processes to
wait until it finishes for fetching the file directly from the cache.
Lock files can be easily created through the :mod:`filelock` package.

For example, let's create a ``download.py`` file that defines a lock file
before calling the :fun:`pooch.retrieve` function.

.. code:: python
# file: download.py
import pooch
import filelock
lock = filelock.LockFile(path="foo.lock")
with lock:
file_path = pooch.retrieve(
url="https://github.com/fatiando/pooch/raw/v1.0.0/data/tiny-data.txt",
known_hash="md5:70e2afd3fd7e336ae478b1e740a5f08e",
path="my_dir",
)
# Perform tasks with this file using different parameters passed as argument
parameter = sys.arg[1] # get parameter from first argument
... # perform tasks using the file and the parameter
We can run this script in parallel using the Bash ampersand:

.. code:: bash
python download.py 1 &
python download.py 2 &
python download.py 3 &
Since we are using a lock file, only one of these process will take care of the
download. The rest will wait for it to finish, and then fetch the file from the
cache. Then all further tasks that the ``download.py`` performs using the
different arguments will be run in parallel as usual.
1 change: 1 addition & 0 deletions env/requirements-docs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@
sphinx==7.2.*
sphinx-book-theme==1.1.*
sphinx-design==0.5.*
filelock
1 change: 1 addition & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ dependencies:
- sphinx==7.2.*
- sphinx-book-theme==1.1.*
- sphinx-design==0.5.*
- filelock
# Style
- pathspec
- black>=20.8b1
Expand Down

0 comments on commit 5d0c140

Please sign in to comment.