Simple tool for bulk downloading OSM and GADM geographic files and doing a minimal amount of preprocessing.
You can install the package two ways, please make sure to read the section on dependencies
Directly from the GitHub repository:
pip install git+https://github.com/mansueto-institute/geopull
By cloning the repo:
git clone https://github.com/mansueto-institute/geopull
pip install -e geopull
If you'd like to contribute, it's suggested that you install the optional dependencies with the [dev]
dynamic metadata for setuptools
. You can do this by either:
pip install "geopull[dev] @ git+https://github.com/mansueto-institute/geopull"
By cloning the repo:
git clone https://github.com/mansueto-institute/geopull
pip install -e geopull[dev]
This will install linters and the requirements for running the tests. For more information as to what is done to the code for testing/linting refer to GitHub Action.
This tool depends on osmium-tool which you can install with conda
The main usage of the package generate blocks from a given OSM country file. The pipeline has four steps:
- Download a file from geofabrik. Note: you might also need the
dalightmap
's coastlines which you can also download with the CLI. - Extract the necessary features for blocking using
osmium-tool
. - Normalize the features using
geopandas
- Generate blocks from the normalized features
Each of these is a separate command that can be called via CLI for a set of countries. For example, below you can find how to run the pipeline for countries DJI
, SYC
.
To first download the daylight project's coastline data, you can run:
geopull download daylight
Now you can download the country OSM files:
geopull download countries DJI SYC
To extract the necessary features from the OSM data, we use osmium-tool
. You can run this with our CLI as well, using our settings, by running:
geopull extract DJI SYC
We normalize some features before we do the blocking. Our normalizer has documentation in the source code if you want to look at it, however for practical purposes you can use the default settings. You can run this by:
geopull normalize DJI SYC
Finally, you can run the blocking process for the two countries:
geopull block DJI SYC
This will create two parquet
files that contain the blocks for each country. All files from the process will be located in ./data
by default.
You can also use our blocking process with two files of your choosing. A notebook example was made that describes this usage.
When you download data, if you don't tell the program where you'd like to keep the files, it will create a data/
directory within your current path. This directory will have subdirectories depending on the type of files that will be store in such subdirectory, such as .pbf
or .geojson
.