The featurization repository is the first repository a user should use in order to execute the MOSAIKS approach. It is designed to guide a user through processing satellite images collected by Landsat 8 and Sentinel 2 on the STAC API. Code is tailored to run on Planetary Computer. This repository contains code for proccessing satellite imagery by encoding with random convolutional features. The methodology and workflow requires connection to a SpatioTemporal Asset Catalog (STAC) such as Microsoft's Planetary Computer (MPC). The base of the code found in this repostiory was originally created by the team at MPC and is hosted as a tutorial on MPC Hub. Our team has adapted and extended this codebase to featurize imagery over time (monthly) and to include a larger portion of the electromagnetic spectrum (i.e., satellite bands).
Alternatively, instead of processing satellite images, a user can download pre-processed feature files from the MOSAIKS API which hosts features collected from a private satellite. The features available on the MOSAIKS API are global and therefore not limited to the country of Zambia. In order to query features from this website, a user should upload a csv of latitude and longitude points (or create a bounding box) and the features will be processed and sent to the user to download. In order to merge these features with other data of interest and execute future analysis, please see the Modeling repository.
Currently this code is adapted to use two satellites, Landsat 8 and Sentinel 2. These satellites were selected for two primary reasons:
- Ideal temporal coverage that overlaps with our crop yield data for the country of Zambia, which allows us to geospatially and temporally join satelite feature data with crop yields to execute a supervised machine learning approach (see the Modeling repository for more information)
- They are public satellite image archives readily available on the MPC data catalog. The existing MOSAIKS pipeline uses private satellite imagery
These satellites provide options for different band combinations, spectral resolutions, and temporal cycles (meaning the time intervals between passes over the country of Zambia). Additionally, these satellites can be used in combination for the years in which they overlap.
Aditionally a user will be required to supply a gesopatial polygon in order to make a grid of points at 0.01 degree intervals in WGS84. In place of a polygon, a simple bounding box can also be used. In general, we have been conducting analysis within a single country and use the outline of that country to first create our grid of points.
More recently, our team has begun by first creating a grid of points that is premasked for cropland area in order to only retain the top 10% of points per district with crop land. These points are then uploaded to the data
folder and used to match imagery in our featurization pipeline.
The technique of Random Convolutional Features (RCFs), a subset of Random Kitchen Sinks, is a way to encode geospatial locations with information based on the satellite image of that location. These features reflect information such as image colors and image textures. This information could be the delimination between colors (like the edge of a field, forest, or building that appears as a line from space), and combinations of colors such as blue next to green. In practice, the specific nature of what information a feature holds is not necessary for it to be useful. In fact, we generally do not investigate possiblities for what a feature might describe, but rather we use them, and the relationship between them to build a model which is capable of predicting what we are interested in.
With the feature data frame that is made from these notebooks, each row represents an image, and each feature represents a column. Each cell contains a numerical value for that feature at that location, which is statistically coorelated with the numerical value of crop yield data for that location during the modeling step (or other data provided by the user). Random Convolutional Features can either be created from the featurization repository in this organization, or downloaded from the MOSAIKS API. For more information about featurization and the MOSAIKS pipeline, please see this paper by Rolf et al. 2021.
- Connection to STAC Collection provider such as Microsoft Planetary Computer (MPC)
- If using MPC see environment docs for further direction
- Computer with Graphical Proccessing Unit (GPU)
- NVIDIA GPU with CUDA
- Familiarity with Python code
PyTorch
in particular
The fastest way to get started is to sign up for a free account with MPC Hub. This process includes a request form, and approval might take 24 hours or more. With an account, a user is given access to cloud-based virtual computing with pre-configured and managed python environments. With several options to choose from, it is important to select the GPU - PyTorch
option to execute these notebooks. This has a longer startup time than other options, but is neccessary for the way our convolutional model is configured. In the end, the GPU will speed up the computation compared to the CPU options.
This repository can be cloned into the root directory of the MPC Hub. Within these notebooks several decisions need to be made in order to pull in data and process the features for the timeline and specific analyses that fits the user's needs. These decisions are described in detail below.
land_cover_9_class.ipynb
There is an optional landcover.ipynb
notebook that can be used to return various land cover land use percentages at a given point (i.e., the same points which you are interested in featurizing) such as cropped area, forrest cover, or built areas. This notebook uses the 10 meter land cover dataset. This notebook is under development and has a known bug that will return NULL values around the UTM zone delineations. It is not recommended to use this until this bug can be fixed.
With MPC access as described above, a user has several options to begin creating their Random Convolutional Features (RCFs). An overview of the steps are as follows:
Use the dense_grid.ipynb
to execute the following steps:
- Create a uniform grid of points over the region of interest, or load a file containg the pre-produced latitude and longitude points to featurize
- User selects a country or region and a grid will be created
- User can supply geometry or a country code can be specified to use the
geopandas
shapefiles - For gridding the country of Zambia specifically, a user has two options:
equal angle
cells versusequal area
grid cellsequal angle
grids are produced using the latitude/longitude geodetic coordinate reference system, EPSG 4326, which is based on Earth's center of mass- results in each point representing a 0.01 by 0.01 degree grid cell that will be featurized
- This is roughly 1 km2 (exact area varies by geographic location)
- This means a pre-processed, user-supplied file should have points with a minimum distance of 0.01 degrees to avoid overlap
equal area
grids are produced using the local coordinate reference system for the region of interest- The local EPSG for the country of Zambia is the defualt, but the relevant EPSG for another region of interest may be supplied by the user
The featurization notebooks are:
rcf_multiband.ipynb
- For use with the
landsat-c2-l2
satellite collection orsentinel-2-l2a
- For use with the
Sentinel_2_RGB.ipynb
- For use with
sentinel-2-l2a
in only the visible spectrum. - MUCH faster than other options
- For use with
s2_l8_multiband.ipynb
- For use with the
landsat-8-c2-l2
satellite collection orsentinel-2-l2a
landsat-8-c2-l2
is now deprecated in favor of thelandsat-c2-l2
collection
- For use with the
Options include selecting a satellite collection, the number of features to produce, the spectral bands, select the time period. These options are selected in the rcf_multiband.ipynb
, Sentinel_2_RGB.ipynb
, or s2_l8_multiband.ipynb
notebooks.
- Select a satellite
landsat-8-c2-l2
sentinel-2-l2a
- Select the desired number of features
- Defaults to 1000, which has resulted in excellent model performance for our project's goals
- Increasing the number of features increases computational cost and the time it takes to execute the notebook
- Select the relevant bands
- Naming conventions are unique to each satellites
- Select a time period to featurize
- Constrained by satellite mission timeline:
- Landsat 8: temporal coverage = February 2013 - present
- Sentinel 2: temporal coverage = June 2015 - present
- Constrained by satellite mission timeline:
The notebook is configured to account for all of your desired inputs, but compute power may limit the extent of what is possible based on selected options
- For example, trying to featurize too many points in a single run may not only be slow, it may crash the kernel or cause a timeout or disconnect error
The general notebook workflow:
- Find an appropriate STAC item for each point (in parallel, using a spatially partitioned dataset of points)
- Feed the points and STAC items to a custom Dataset that can read imagery given a point and the URL of a overlapping satellite scene
- Use
stackstac
to stack the various bands of interest - Use a custom Dataloader, which uses our Dataset, to feed our model imagery and save the corresponding features
- Loop through the year and month combinations selected by the user to output feature files in a compressed feather file format
- MPC Hub persistent storage
- Limited to 15 gigabytes (with access to large temporary storage: ~200 gigabytes within a single session)
- Exceeding storage limits can cause your environment to not load on the next session
- Download output files and delete from the hub regularly
- MPC compute power
- A generous amount of compute power is provided free of charge, but processing may still be slow
- MPC Hub RAM
- Most runs will push the given memory to its limit, and sometimes past
- We have tried to implement aggressive memory recovery through deletion of objects and garbage collection
- MPC GPU node limit. The MPC Hub provides free GPU access but access is limited. The GPU nodes are first come, first serve. It can be frustrating to access The MPC Hub when all nodes are being consumed by other users.
- There are many ways to do compute with MPC data and it may be neccessary to use alternate options
There are many expansions and future research that could be contributed to this project of extending the MOSAIKS approach. A few of these ideas include:
- Testing how the cloud cover limit effects results. Currently set to 10% and would recommend testing 15%, 20%, or more
- Alternatively, determining best method for using every least cloudy image for any given month and only throwing away 1 km points that do not meet a cloud thresshold. In this way, more whole images may be retained, and fewer points would be lost to cloud cover.
- Producing features for regions other than Zambia, such as Tanzania and Nigeria, as those are other countries in sub-Saharan Africa with crop yield data (While the CropMOSAIKS team has access to this crop data, Tanzania and Nigeria were out of the scope. With more time and features for these countries, the CropMOSAIKS team aims to eventually model crop yields for regions beyond Zambia.)
The capstone project was completed on June 9th, 2022. Some team members will be continuing work in this field and the repositories are likely to stay active for some time to come. Suggestions for improvements to the code or documentation is welcome and encouraged. Please submit questions, comments, or code via issues or pull requests on either of the repositories. To correspond with the data scientists who produced these materials that extend the MOSAIKS approach, please see their personal GitHub accounts at the bottom of the organization's README and feel free to contact them via email.
If you are interested in processing features for a new region other than Zambia and contributing these features to the MOSAIKS API, please see the GitHub repository here and create a pull request or issue. Additionally, you can contact the authors of the MOSAIKS paper with questions about the process.
For rules and regulations for this organization, please see the Code of Conduct