Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation improvements #462

Merged
merged 1 commit into from
Dec 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 2 additions & 7 deletions docs/catalogs/arguments.rst
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ When instantiating a pipeline, you can use the ``resume`` flag to indicate that
we can resume from an earlier execution of the pipeline. By default, if any resume
files are found, we will restore the pipeline's previous progress.

If you want to start the pipeline from scratch you can simply set `resume=False`.
If you want to start the pipeline from scratch you can simply set ``resume=False``.
Alternatively, go to the temp directory you've specified and remove any intermediate
files created by the previous runs of the ``hats-import`` pipeline. You should also
remove the output directory if it has any content. The resume argument performs these
Expand Down Expand Up @@ -179,7 +179,7 @@ Benefits:
2. If the files are very small, batching them in this way allows the import
process to *combine* several small files into a single chunk for processing.
This will result in fewer intermediate files during the ``splitting`` stage.
3. If you have a parquet files over a slow networked file system, we support
3. If you have parquet files over a slow networked file system, we support
pyarrow's readahead protocol through indexed readers.

Warnings:
Expand Down Expand Up @@ -291,11 +291,6 @@ You must specify where you want your catalog data to be written, using
``output_path``. This path should be the base directory for your catalogs, as
the full path for the catalog will take the form of ``output_path/output_artifact_name``.

If there is already catalog data in the indicated directory, you can force a
new catalog to be written in the directory with the ``overwrite`` flag. It's
preferable to delete any existing contents, however, as this may cause
unexpected side effects.

If you're writing to cloud storage, or otherwise have some filesystem credential
dict, initialize ``output_path`` using ``universal_pathlib``'s utilities.

Expand Down
7 changes: 1 addition & 6 deletions docs/catalogs/temp_files.rst
Original file line number Diff line number Diff line change
Expand Up @@ -107,12 +107,7 @@ Mapping stage

In this stage, we're reading each input file and building a map of how many objects are in
each high order pixel. For each input file, once finished, we will write a binary file with
the numpy array representing the number of objects in each pixel.

.. tip::
For ``highest_healpix_order=10``, this binary file is 96M. If you know your data will be
partitioned at a lower order (e.g. order 7), using the lower order in the arguments
can improve runtime and disk usage of the pipeline.
a sparse array representing the number of objects in each pixel.

Binning stage
...............................................................................
Expand Down
29 changes: 27 additions & 2 deletions docs/guide/contact.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,30 @@ making our products better, or pretty much anything else, reach out!

* Open an issue in our github repo for hats-import
* https://github.com/astronomy-commons/hats-import/issues/new
* If you're on LSSTC slack, so are we!
`#lincc-frameworks-qa <https://lsstc.slack.com/archives/C062LG1AK1S>`_
* Start a new github discussion
* https://github.com/astronomy-commons/hats-import/discussions/new/choose
* If you're on LSSTC slack, so are we!
* `#lincc-frameworks-qa <https://lsstc.slack.com/archives/C062LG1AK1S>`_
* `#lincc-frameworks-lsdb <https://lsstc.slack.com/archives/C04610PQW9F>`_
* Join the working group, where we discuss HATS standardization and early results
* Google group: `[email protected] <https://groups.google.com/g/hipscat-wg>`_
* You can listen in to demo meetings, or ask questions during co-working sessions.
Events are published on a google calendar, embedded below.
* Key:
* "HATS/LSDB Working Meeting" - Drop-in co-working/office hours.
* "HATS/LSDB Europe Working group" - Intended as European time zone friendly
discussion of HATS and LSDB. Generally open.
* "HATS/LSDB Monthly Demos" - A more structured telecon, with updates from
developers and collaborators, and HATS standardization planning.
* "LINCC Tech Talk" - Tech Talk series for software spanning LSST.
* "LINCC Frameworks Office Hours" - General office hours for anything
related to software maintained by LINCC Frameworks, during LINCC
incubators, or general software advice.

.. raw:: html

<iframe src="https://calendar.google.com/calendar/embed?height=600&wkst=1&ctz=America%2FNew_York&showPrint=0&src=Y180YTU1MTFiMDJiNjQ0OTlkNzIxNGE3Y2Y1NWY3NTE3NTY5YmE5NjQ1Y2FiMWM0YzA4YTdjYTQxYTIwNDE3YWQ1QGdyb3VwLmNhbGVuZGFyLmdvb2dsZS5jb20&src=NWI3MDkyYTAxOTZlMjkwODQ4ODEwOGYzMTk2NjM3Yjg0MzU4ZWNlNjIwMzJkYTVhYzY4ZWRjMGIwNGM5ZWFkNUBncm91cC5jYWxlbmRhci5nb29nbGUuY29t&color=%23F4511E&color=%23F09300" style="border:solid 1px #777" width="800" height="600" frameborder="0" scrolling="no"></iframe>

However you reach out, we want to make sure that any discourse is open and
inclusive, and we ask that everyone involved read and adhere to the
`LINCC Frameworks Code of Conduct <https://lsstdiscoveryalliance.org/programs/lincc-frameworks/code-conduct/>`_
135 changes: 75 additions & 60 deletions docs/guide/contributing.rst
Original file line number Diff line number Diff line change
@@ -1,63 +1,69 @@
Contributing to hats-import
===============================================================================

Find (or make) a new GitHub issue
-------------------------------------------------------------------------------
HATS, hats-import, and LSDB are primarily written and maintained by LINCC Frameworks, but we
would love to turn it over to the open-source scientific community!! We want to
make sure that any discourse is open and inclusive, and we ask that everyone
involved read and adhere to the
`LINCC Frameworks Code of Conduct <https://lsstdiscoveryalliance.org/programs/lincc-frameworks/code-conduct/>`_

Add yourself as the assignee on an existing issue so that we know who's working
on what. (If you're not actively working on an issue, unassign yourself).
Installation from Source
------------------------

If there isn't an issue for the work you want to do, please create one and include
a description.
To install the latest development version of hats-import you will want to build it from source.
First, with your virtual environment activated, type in your terminal:

You can reach the team with bug reports, feature requests, and general inquiries
by creating a new GitHub issue.
.. code-block:: bash

.. tip::
Want to help?
git clone https://github.com/astronomy-commons/hats-import
cd hats-import/

Do you want to help out, but you're not sure how? :doc:`/guide/contact`
To install the package and dependencies you can run the ``setup_dev`` script which installs all
the requirements to setup a development environment.

Create a branch
-------------------------------------------------------------------------------
.. code-block:: bash

It is preferable that you create a new branch with a name like
``issue/##/<short-description>``. GitHub makes it pretty easy to associate
branches and tickets, but it's nice when it's in the name.
chmod +x .setup_dev.sh
./.setup_dev.sh

Setting up a development environment
-------------------------------------------------------------------------------
Finally, to check that your package has been correctly installed, run the package unit tests:

Before installing any dependencies or writing code, it's a great idea to create a
virtual environment. LINCC-Frameworks engineers primarily use `conda` to manage virtual
environments. If you have conda installed locally, you can run the following to
create and activate a new environment.
.. code-block:: bash

.. code-block:: console
python -m pytest

>> conda create env -n <env_name> python=3.10
>> conda activate <env_name>
Find (or make) a new GitHub issue
-------------------------------------------------------------------------------

Add yourself as the assignee on an existing issue so that we know who's working
on what. If you're not actively working on an issue, unassign yourself.

If there isn't an issue for the work you want to do, please create one and include
a description.

Once you have created a new environment, you can install this project for local
development using the following command:
You can reach the team with bug reports, feature requests, and general inquiries
by creating a new GitHub issue.

.. code-block:: console
Note that you may need to make changes in multiple repos to fully implement new
features or bug fixes! See related projects:

>> source .setup_dev.sh
* HATS (`on GitHub <https://github.com/astronomy-commons/hats>`_
and `on ReadTheDocs <https://hats.readthedocs.io/en/stable/>`_)
* LSDB (`on GitHub <https://github.com/astronomy-commons/lsdb>`_
and `on ReadTheDocs <https://docs.lsdb.io>`_)

Fork the repository
-------------------------------------------------------------------------------

Notes:
Contributing to hats-import requires you to `fork <https://github.com/astronomy-commons/hats-import/fork>`_
the GitHub repository. The next steps assume the creation of branches and PRs are performed from your fork.

1) The single quotes around ``'[dev]'`` may not be required for your operating system.
2) ``pre-commit install`` will initialize pre-commit for this local repository, so
that a set of tests will be run prior to completing a local commit. For more
information, see the Python Project Template documentation on
`pre-commit <https://lincc-ppt.readthedocs.io/en/stable/practices/precommit.html>`_.
3) Installing ``pandoc`` allows you to verify that automatic rendering of Jupyter notebooks
into documentation for ReadTheDocs works as expected. For more information, see
the Python Project Template documentation on
`Sphinx and Python Notebooks <https://lincc-ppt.readthedocs.io/en/stable/practices/sphinx.html#python-notebooks>`_.
.. note::

If you are (or expect to be) a frequent contributor, you should consider requesting
access to the `hats-friends <https://github.com/orgs/astronomy-commons/teams/hats-friends>`_
working group. Members of this GitHub group should be able to create branches and PRs directly
on LSDB, hats and hats-import, without the need of a fork.

Testing
-------------------------------------------------------------------------------
Expand All @@ -72,43 +78,52 @@ paths. These are defined in ``conftest.py`` files. They're powerful and flexible
Please add or update unit tests for all changes made to the codebase. You can run
unit tests locally simply with:

.. code-block:: console
.. code-block:: bash

>> pytest
pytest

If you're making changes to the sphinx documentation (anything under ``docs``),
you can build the documentation locally with a command like:

.. code-block:: console
.. code-block:: bash

>> cd docs
>> make html
cd docs
make html

Create your PR
-------------------------------------------------------------------------------
We also have a handful of automated linters and checks using ``pre-commit``. You
can run against all staged changes with the command:

.. code-block:: bash

Please use PR best practices, and get someone to review your code.
pre-commit

The LINCC Frameworks guidelines and philosophy on code reviews can be found on
`our wiki <https://github.com/lincc-frameworks/docs/wiki/Design-and-Code-Review-Policy>`_.
Create a branch
-------------------------------------------------------------------------------

We have a suite of continuous integration tests that run on PR creation. Please
follow the recommendations of the linter.
It is preferable that you create a new branch with a name like
``issue/##/<short-description>``. GitHub makes it pretty easy to associate
branches and tickets, but it's nice when it's in the name.

Merge your PR
Create your PR
-------------------------------------------------------------------------------

The author of the PR is welcome to merge their own PR into the repository.
You will be required to get your code approved before merging into main.
If you're not sure who to send it to, you can use the round-robin assignment
to the ``astronomy-commons/lincc-frameworks`` group.

We have a suite of continuous integration checks that run on PR creation. Please
follow the code quality recommendations of the linter and formatter, and make sure
every pipeline passes before submitting it for review.

Optional - Release a new version
Merge your PR
-------------------------------------------------------------------------------

Once your PR is merged you can create a new release to make your changes available.
GitHub's `instructions <https://docs.github.com/en/repositories/releasing-projects-on-github/managing-releases-in-a-repository>`_ for doing so are here.
Use your best judgement when incrementing the version. i.e. is this a major, minor, or patch fix.
When all the continuous integration checks have passed and upon receiving an
approving review, the author of the PR is welcome to merge it into the repository.

Be kind
Release new version
-------------------------------------------------------------------------------

You are expected to comply with the
`LINCC Frameworks Code of Conduct <https://lsstdiscoveryalliance.org/programs/lincc-frameworks/code-conduct/>`_`.
New versions are manually tagged and automatically released to pypi. To request
a new release of LSDB, HATS, and hats-import packages, create a
`release ticket <https://github.com/astronomy-commons/lsdb/issues/new?assignees=delucchi-cmu&labels=&projects=&template=4-release_tracker.md&title=Release%3A+>`_.
7 changes: 1 addition & 6 deletions docs/guide/hipscat_conversion.rst
Original file line number Diff line number Diff line change
Expand Up @@ -112,11 +112,6 @@ You must specify where you want your HATS table to be written, using
``output_path``. This path should be the base directory for your catalogs, as
the full path for the HATS table will take the form of ``output_path/output_artifact_name``.

If there is already catalog data in the indicated directory, you can
force new data to be written in the directory with the ``overwrite`` flag. It's
preferable to delete any existing contents, however, as this may cause
unexpected side effects.

If you're writing to cloud storage, or otherwise have some filesystem credential
dict, initialize ``output_path`` using ``universal_pathlib``'s utilities.

Expand All @@ -134,4 +129,4 @@ What next?

You can validate that your new HATS catalog meets both the HATS/LSDB expectations,
as well as your own expectations of the data contents. You can follow along with the
`Manual catalog verification <https://docs.lsdb.io/en/stable/tutorials/manual_verification.html>`_.
`Manual catalog verification <https://docs.lsdb.io/en/stable/tutorials/pre_executed/manual_verification.html>`_.
32 changes: 7 additions & 25 deletions docs/guide/index_table.rst
Original file line number Diff line number Diff line change
Expand Up @@ -157,30 +157,17 @@ list along to your ``ImportArguments``!
indexing_column="target_id"

## you might not need to change anything after that.
total_metadata = file_io.read_parquet_metadata(os.path.join(input_catalog_path, "_metadata"))

# This block just finds the indexing column within the _metadata file
first_row_group = total_metadata.row_group(0)
index_column_idx = -1
for i in range(0, first_row_group.num_columns):
column = first_row_group.column(i)
if column.path_in_schema == indexing_column:
index_column_idx = i

# Now loop through all of the partitions in the input data and find the
# overall bounds of the indexing_column.
num_row_groups = total_metadata.num_row_groups
global_min = total_metadata.row_group(0).column(index_column_idx).statistics.min
global_max = total_metadata.row_group(0).column(index_column_idx).statistics.max

for index in range(1, num_row_groups):
global_min = min(global_min, total_metadata.row_group(index).column(index_column_idx).statistics.min)
global_max = max(global_max, total_metadata.row_group(index).column(index_column_idx).statistics.max)
catalog = hats.read_hats(input_catalog_path)
all_stats = catalog.aggregate_column_statistics()

global_min = all_stats.at[indexing_column, "min_value"]
global_max = all_stats.at[indexing_column, "max_value"]
num_partitions = len(catalog.get_healpix_pixels())

print("global min", global_min)
print("global max", global_max)

increment = int((global_max-global_min)/num_row_groups)
increment = int((global_max-global_min)/num_partitions)

divisions = np.append(np.arange(start = global_min, stop = global_max, step = increment), global_max)
divisions = divisions.tolist()
Expand Down Expand Up @@ -223,11 +210,6 @@ You must specify where you want your index table to be written, using
``output_path``. This path should be the base directory for your catalogs, as
the full path for the index will take the form of ``output_path/output_artifact_name``.

If there is already catalog or index data in the indicated directory, you can
force new data to be written in the directory with the ``overwrite`` flag. It's
preferable to delete any existing contents, however, as this may cause
unexpected side effects.

If you're writing to cloud storage, or otherwise have some filesystem credential
dict, initialize ``output_path`` using ``universal_pathlib``'s utilities.

Expand Down
5 changes: 0 additions & 5 deletions docs/guide/margin_cache.rst
Original file line number Diff line number Diff line change
Expand Up @@ -135,11 +135,6 @@ You must specify where you want your margin data to be written, using
``output_path``. This path should be the base directory for your catalogs, as
the full path for the margin will take the form of ``output_path/output_artifact_name``.

If there is already catalog or margin data in the indicated directory, you can
force new data to be written in the directory with the ``overwrite`` flag. It's
preferable to delete any existing contents, however, as this may cause
unexpected side effects.

If you're writing to cloud storage, or otherwise have some filesystem credential
dict, initialize ``output_path`` using ``universal_pathlib``'s utilities.

Expand Down
Loading