-
Notifications
You must be signed in to change notification settings - Fork 0
LSDB
HATS - Hierarchical Adaptive Tiling Scheme
HATS is a directory structure and metadata for spatially arranging large catalog survey data. This was originally motivated by a desire to perform spatial cross-matching between surveys at large scale, but is applicable to a range of spatial analysis and algorithms.
We use healpix pixels at various orders to divide the sky into partitions, where each partition will have roughly the same number of objects, instead of dividing into equal area. Because each partition is roughly the same size on disk, we can expect reasonable performance of parallel operations on each partition.
We use parquet as the underlying storage format, as it provides efficient storage and retrieval of tabular data.
NB: This was previously named HiPSCat - Hierarchical Partitioned Survey Catalog - Storage for scalable catalog cross-matching
LSDB - Large Survey Database - A framework for scalable spatial analysis
LSDB is an analytics framework, built on top of the HATS format. It is a python library that can read and interpret HATS-formatted catalogs and perform parallel operations on the underlying partitioned data.
This provides a reference implementation of scaled cross-matching using the HATS structure. In addition, we intend to support analysis and filtering on survey data, both before and after cross-matching.
What LSDB is NOT
This is NOT a full relational database, and focuses instead on spatial operations and full-survey analytics. At this time, we do not support updates of survey data. We do not provide heavy optimization for non-spatial queries and filtering.
Status: Active development
Working Group:
- Design doc access through: [email protected] https://groups.google.com/g/hipscat-wg
- Github maintenance through: https://github.com/orgs/astronomy-commons/teams/hipscat-friends
We've implemented a HATS library in python, to read and interpret metadata about catalogs (but not interact with the partitioned parquet files). (hipscat)
We've implemented LSDB in python, using dask as the supporting parallelization framework. (lsdb)
The LSDB library depends on the HATS library for reading catalog metadata.
In addition, we've implemented a hats-import tool which reads survey data from a variety of existing formats, and writes them in HATS format. (hats-import)
Issue Tracker: Joined issue search
Contributing LINCC Frameworks Team Members: Melissa DeLucchi, Mario Juric, Max West, Sam Wyatt, Sean McGuire, Sandro Campos, Konstantin Malanchev
Current / Recent Efforts:
2023 Q4 goals
- hipscat/hipscat-import documentation beta testing
- timedomain MVP (joint effort with LINCC Frameworks TAPE)
- margin productionization
- hipscat (data) feature freeze
- ADASS 23: hipscat/LSDB tutorial; get more feedback from IVOA
- LSDB v0.1 for alpha testing
Past Milestones
- v0.1 alpha (mid-July 2023)
- hipscat-import
- Map-reduce-based pipeline for creating basic file structure creation from original survey format (e.g. take a few large CSVs and convert them into structured parquet with root-level metadata)
- User guide for getting started with your own datasets.
- HiPSCat
- Read catalog metadata, as written by hipscat-import.
- Perform basic spatial filtering (e.g. cone search) and list relevant parquet files.
- Simple mollweide visualization of partitions.
- LSDB
- This is NOT scheduled to have a public release at this time.
- Based on feedback from alpha users, we will revisit our features and priorities for a v0.2 alpha round.
- hipscat-import
- Q1 2023
- HiPSCat Library MVP (minimum viable product)
- LSDB MVP
- Q4 2022
- HiPSCat format prototype
- Converted gaia into HiPSCat format
- Prototyped cross matching with dask dataframes
To keep up to date on the effort, request membership in the working group: https://groups.google.com/g/hipscat-wg