diff --git a/docs/catalogs/arguments.rst b/docs/catalogs/arguments.rst index fa34ff76..df2f0b81 100644 --- a/docs/catalogs/arguments.rst +++ b/docs/catalogs/arguments.rst @@ -86,7 +86,7 @@ When instantiating a pipeline, you can use the ``resume`` flag to indicate that we can resume from an earlier execution of the pipeline. By default, if any resume files are found, we will restore the pipeline's previous progress. -If you want to start the pipeline from scratch you can simply set `resume=False`. +If you want to start the pipeline from scratch you can simply set ``resume=False``. Alternatively, go to the temp directory you've specified and remove any intermediate files created by the previous runs of the ``hats-import`` pipeline. You should also remove the output directory if it has any content. The resume argument performs these @@ -179,7 +179,7 @@ Benefits: 2. If the files are very small, batching them in this way allows the import process to *combine* several small files into a single chunk for processing. This will result in fewer intermediate files during the ``splitting`` stage. -3. If you have a parquet files over a slow networked file system, we support +3. If you have parquet files over a slow networked file system, we support pyarrow's readahead protocol through indexed readers. Warnings: @@ -291,11 +291,6 @@ You must specify where you want your catalog data to be written, using ``output_path``. This path should be the base directory for your catalogs, as the full path for the catalog will take the form of ``output_path/output_artifact_name``. -If there is already catalog data in the indicated directory, you can force a -new catalog to be written in the directory with the ``overwrite`` flag. It's -preferable to delete any existing contents, however, as this may cause -unexpected side effects. - If you're writing to cloud storage, or otherwise have some filesystem credential dict, initialize ``output_path`` using ``universal_pathlib``'s utilities. diff --git a/docs/catalogs/temp_files.rst b/docs/catalogs/temp_files.rst index a752084d..8d23a6a1 100644 --- a/docs/catalogs/temp_files.rst +++ b/docs/catalogs/temp_files.rst @@ -107,12 +107,7 @@ Mapping stage In this stage, we're reading each input file and building a map of how many objects are in each high order pixel. For each input file, once finished, we will write a binary file with -the numpy array representing the number of objects in each pixel. - -.. tip:: - For ``highest_healpix_order=10``, this binary file is 96M. If you know your data will be - partitioned at a lower order (e.g. order 7), using the lower order in the arguments - can improve runtime and disk usage of the pipeline. +a sparse array representing the number of objects in each pixel. Binning stage ............................................................................... diff --git a/docs/guide/contact.rst b/docs/guide/contact.rst index 5645b658..572a7006 100644 --- a/docs/guide/contact.rst +++ b/docs/guide/contact.rst @@ -8,5 +8,30 @@ making our products better, or pretty much anything else, reach out! * Open an issue in our github repo for hats-import * https://github.com/astronomy-commons/hats-import/issues/new -* If you're on LSSTC slack, so are we! - `#lincc-frameworks-qa `_ \ No newline at end of file +* Start a new github discussion + * https://github.com/astronomy-commons/hats-import/discussions/new/choose +* If you're on LSSTC slack, so are we! + * `#lincc-frameworks-qa `_ + * `#lincc-frameworks-lsdb `_ +* Join the working group, where we discuss HATS standardization and early results + * Google group: `hipscat-wg@googlegroups.com `_ + * You can listen in to demo meetings, or ask questions during co-working sessions. + Events are published on a google calendar, embedded below. + * Key: + * "HATS/LSDB Working Meeting" - Drop-in co-working/office hours. + * "HATS/LSDB Europe Working group" - Intended as European time zone friendly + discussion of HATS and LSDB. Generally open. + * "HATS/LSDB Monthly Demos" - A more structured telecon, with updates from + developers and collaborators, and HATS standardization planning. + * "LINCC Tech Talk" - Tech Talk series for software spanning LSST. + * "LINCC Frameworks Office Hours" - General office hours for anything + related to software maintained by LINCC Frameworks, during LINCC + incubators, or general software advice. + +.. raw:: html + + + +However you reach out, we want to make sure that any discourse is open and +inclusive, and we ask that everyone involved read and adhere to the +`LINCC Frameworks Code of Conduct `_ \ No newline at end of file diff --git a/docs/guide/contributing.rst b/docs/guide/contributing.rst index 7aa478a8..1498ec70 100644 --- a/docs/guide/contributing.rst +++ b/docs/guide/contributing.rst @@ -1,63 +1,69 @@ Contributing to hats-import =============================================================================== -Find (or make) a new GitHub issue -------------------------------------------------------------------------------- +HATS, hats-import, and LSDB are primarily written and maintained by LINCC Frameworks, but we +would love to turn it over to the open-source scientific community!! We want to +make sure that any discourse is open and inclusive, and we ask that everyone +involved read and adhere to the +`LINCC Frameworks Code of Conduct `_ -Add yourself as the assignee on an existing issue so that we know who's working -on what. (If you're not actively working on an issue, unassign yourself). +Installation from Source +------------------------ -If there isn't an issue for the work you want to do, please create one and include -a description. +To install the latest development version of hats-import you will want to build it from source. +First, with your virtual environment activated, type in your terminal: -You can reach the team with bug reports, feature requests, and general inquiries -by creating a new GitHub issue. +.. code-block:: bash -.. tip:: - Want to help? + git clone https://github.com/astronomy-commons/hats-import + cd hats-import/ - Do you want to help out, but you're not sure how? :doc:`/guide/contact` +To install the package and dependencies you can run the ``setup_dev`` script which installs all +the requirements to setup a development environment. -Create a branch -------------------------------------------------------------------------------- +.. code-block:: bash -It is preferable that you create a new branch with a name like -``issue/##/``. GitHub makes it pretty easy to associate -branches and tickets, but it's nice when it's in the name. + chmod +x .setup_dev.sh + ./.setup_dev.sh -Setting up a development environment -------------------------------------------------------------------------------- +Finally, to check that your package has been correctly installed, run the package unit tests: -Before installing any dependencies or writing code, it's a great idea to create a -virtual environment. LINCC-Frameworks engineers primarily use `conda` to manage virtual -environments. If you have conda installed locally, you can run the following to -create and activate a new environment. +.. code-block:: bash -.. code-block:: console + python -m pytest - >> conda create env -n python=3.10 - >> conda activate +Find (or make) a new GitHub issue +------------------------------------------------------------------------------- + +Add yourself as the assignee on an existing issue so that we know who's working +on what. If you're not actively working on an issue, unassign yourself. +If there isn't an issue for the work you want to do, please create one and include +a description. -Once you have created a new environment, you can install this project for local -development using the following command: +You can reach the team with bug reports, feature requests, and general inquiries +by creating a new GitHub issue. -.. code-block:: console +Note that you may need to make changes in multiple repos to fully implement new +features or bug fixes! See related projects: - >> source .setup_dev.sh +* HATS (`on GitHub `_ + and `on ReadTheDocs `_) +* LSDB (`on GitHub `_ + and `on ReadTheDocs `_) +Fork the repository +------------------------------------------------------------------------------- -Notes: +Contributing to hats-import requires you to `fork `_ +the GitHub repository. The next steps assume the creation of branches and PRs are performed from your fork. -1) The single quotes around ``'[dev]'`` may not be required for your operating system. -2) ``pre-commit install`` will initialize pre-commit for this local repository, so - that a set of tests will be run prior to completing a local commit. For more - information, see the Python Project Template documentation on - `pre-commit `_. -3) Installing ``pandoc`` allows you to verify that automatic rendering of Jupyter notebooks - into documentation for ReadTheDocs works as expected. For more information, see - the Python Project Template documentation on - `Sphinx and Python Notebooks `_. +.. note:: + + If you are (or expect to be) a frequent contributor, you should consider requesting + access to the `hats-friends `_ + working group. Members of this GitHub group should be able to create branches and PRs directly + on LSDB, hats and hats-import, without the need of a fork. Testing ------------------------------------------------------------------------------- @@ -72,43 +78,52 @@ paths. These are defined in ``conftest.py`` files. They're powerful and flexible Please add or update unit tests for all changes made to the codebase. You can run unit tests locally simply with: -.. code-block:: console +.. code-block:: bash - >> pytest + pytest If you're making changes to the sphinx documentation (anything under ``docs``), you can build the documentation locally with a command like: -.. code-block:: console +.. code-block:: bash - >> cd docs - >> make html + cd docs + make html -Create your PR -------------------------------------------------------------------------------- +We also have a handful of automated linters and checks using ``pre-commit``. You +can run against all staged changes with the command: + +.. code-block:: bash -Please use PR best practices, and get someone to review your code. + pre-commit -The LINCC Frameworks guidelines and philosophy on code reviews can be found on -`our wiki `_. +Create a branch +------------------------------------------------------------------------------- -We have a suite of continuous integration tests that run on PR creation. Please -follow the recommendations of the linter. +It is preferable that you create a new branch with a name like +``issue/##/``. GitHub makes it pretty easy to associate +branches and tickets, but it's nice when it's in the name. -Merge your PR +Create your PR ------------------------------------------------------------------------------- -The author of the PR is welcome to merge their own PR into the repository. +You will be required to get your code approved before merging into main. +If you're not sure who to send it to, you can use the round-robin assignment +to the ``astronomy-commons/lincc-frameworks`` group. + +We have a suite of continuous integration checks that run on PR creation. Please +follow the code quality recommendations of the linter and formatter, and make sure +every pipeline passes before submitting it for review. -Optional - Release a new version +Merge your PR ------------------------------------------------------------------------------- -Once your PR is merged you can create a new release to make your changes available. -GitHub's `instructions `_ for doing so are here. -Use your best judgement when incrementing the version. i.e. is this a major, minor, or patch fix. +When all the continuous integration checks have passed and upon receiving an +approving review, the author of the PR is welcome to merge it into the repository. -Be kind +Release new version ------------------------------------------------------------------------------- -You are expected to comply with the -`LINCC Frameworks Code of Conduct `_`. \ No newline at end of file +New versions are manually tagged and automatically released to pypi. To request +a new release of LSDB, HATS, and hats-import packages, create a +`release ticket `_. \ No newline at end of file diff --git a/docs/guide/hipscat_conversion.rst b/docs/guide/hipscat_conversion.rst index 20cdd8ab..84ae27fb 100644 --- a/docs/guide/hipscat_conversion.rst +++ b/docs/guide/hipscat_conversion.rst @@ -112,11 +112,6 @@ You must specify where you want your HATS table to be written, using ``output_path``. This path should be the base directory for your catalogs, as the full path for the HATS table will take the form of ``output_path/output_artifact_name``. -If there is already catalog data in the indicated directory, you can -force new data to be written in the directory with the ``overwrite`` flag. It's -preferable to delete any existing contents, however, as this may cause -unexpected side effects. - If you're writing to cloud storage, or otherwise have some filesystem credential dict, initialize ``output_path`` using ``universal_pathlib``'s utilities. @@ -134,4 +129,4 @@ What next? You can validate that your new HATS catalog meets both the HATS/LSDB expectations, as well as your own expectations of the data contents. You can follow along with the -`Manual catalog verification `_. +`Manual catalog verification `_. diff --git a/docs/guide/index_table.rst b/docs/guide/index_table.rst index 9bff8ea1..4050761b 100644 --- a/docs/guide/index_table.rst +++ b/docs/guide/index_table.rst @@ -157,30 +157,17 @@ list along to your ``ImportArguments``! indexing_column="target_id" ## you might not need to change anything after that. - total_metadata = file_io.read_parquet_metadata(os.path.join(input_catalog_path, "_metadata")) - - # This block just finds the indexing column within the _metadata file - first_row_group = total_metadata.row_group(0) - index_column_idx = -1 - for i in range(0, first_row_group.num_columns): - column = first_row_group.column(i) - if column.path_in_schema == indexing_column: - index_column_idx = i - - # Now loop through all of the partitions in the input data and find the - # overall bounds of the indexing_column. - num_row_groups = total_metadata.num_row_groups - global_min = total_metadata.row_group(0).column(index_column_idx).statistics.min - global_max = total_metadata.row_group(0).column(index_column_idx).statistics.max - - for index in range(1, num_row_groups): - global_min = min(global_min, total_metadata.row_group(index).column(index_column_idx).statistics.min) - global_max = max(global_max, total_metadata.row_group(index).column(index_column_idx).statistics.max) + catalog = hats.read_hats(input_catalog_path) + all_stats = catalog.aggregate_column_statistics() + + global_min = all_stats.at[indexing_column, "min_value"] + global_max = all_stats.at[indexing_column, "max_value"] + num_partitions = len(catalog.get_healpix_pixels()) print("global min", global_min) print("global max", global_max) - increment = int((global_max-global_min)/num_row_groups) + increment = int((global_max-global_min)/num_partitions) divisions = np.append(np.arange(start = global_min, stop = global_max, step = increment), global_max) divisions = divisions.tolist() @@ -223,11 +210,6 @@ You must specify where you want your index table to be written, using ``output_path``. This path should be the base directory for your catalogs, as the full path for the index will take the form of ``output_path/output_artifact_name``. -If there is already catalog or index data in the indicated directory, you can -force new data to be written in the directory with the ``overwrite`` flag. It's -preferable to delete any existing contents, however, as this may cause -unexpected side effects. - If you're writing to cloud storage, or otherwise have some filesystem credential dict, initialize ``output_path`` using ``universal_pathlib``'s utilities. diff --git a/docs/guide/margin_cache.rst b/docs/guide/margin_cache.rst index 613df6cb..6539725e 100644 --- a/docs/guide/margin_cache.rst +++ b/docs/guide/margin_cache.rst @@ -135,11 +135,6 @@ You must specify where you want your margin data to be written, using ``output_path``. This path should be the base directory for your catalogs, as the full path for the margin will take the form of ``output_path/output_artifact_name``. -If there is already catalog or margin data in the indicated directory, you can -force new data to be written in the directory with the ``overwrite`` flag. It's -preferable to delete any existing contents, however, as this may cause -unexpected side effects. - If you're writing to cloud storage, or otherwise have some filesystem credential dict, initialize ``output_path`` using ``universal_pathlib``'s utilities.