astronomy-commons · delucchi-cmu · Dec 20, 2024 · Dec 20, 2024
diff --git a/docs/catalogs/arguments.rst b/docs/catalogs/arguments.rst
@@ -86,7 +86,7 @@ When instantiating a pipeline, you can use the ``resume`` flag to indicate that
 we can resume from an earlier execution of the pipeline. By default, if any resume
 files are found, we will restore the pipeline's previous progress.
 
-If you want to start the pipeline from scratch you can simply set `resume=False`.
+If you want to start the pipeline from scratch you can simply set ``resume=False``.
 Alternatively, go to the temp directory you've specified and remove any intermediate
 files created by the previous runs of the ``hats-import`` pipeline. You should also
 remove the output directory if it has any content. The resume argument performs these
@@ -179,7 +179,7 @@ Benefits:
 2. If the files are very small, batching them in this way allows the import 
    process to *combine* several small files into a single chunk for processing.
    This will result in fewer intermediate files during the ``splitting`` stage.
-3. If you have a parquet files over a slow networked file system, we support
+3. If you have parquet files over a slow networked file system, we support
    pyarrow's readahead protocol through indexed readers.
 
 Warnings:
@@ -291,11 +291,6 @@ You must specify where you want your catalog data to be written, using
 ``output_path``. This path should be the base directory for your catalogs, as 
 the full path for the catalog will take the form of ``output_path/output_artifact_name``.
 
-If there is already catalog data in the indicated directory, you can force a 
-new catalog to be written in the directory with the ``overwrite`` flag. It's
-preferable to delete any existing contents, however, as this may cause 
-unexpected side effects.
-
 If you're writing to cloud storage, or otherwise have some filesystem credential
 dict, initialize ``output_path`` using ``universal_pathlib``'s utilities.
 

diff --git a/docs/catalogs/temp_files.rst b/docs/catalogs/temp_files.rst
@@ -107,12 +107,7 @@ Mapping stage
 
 In this stage, we're reading each input file and building a map of how many objects are in 
 each high order pixel. For each input file, once finished, we will write a binary file with 
-the numpy array representing the number of objects in each pixel. 
-
-.. tip::
-    For ``highest_healpix_order=10``, this binary file is 96M. If you know your data will be 
-    partitioned at a lower order (e.g. order 7), using the lower order in the arguments 
-    can improve runtime and disk usage of the pipeline.
+a sparse array representing the number of objects in each pixel. 
 
 Binning stage
 ...............................................................................

diff --git a/docs/guide/contact.rst b/docs/guide/contact.rst
@@ -8,5 +8,30 @@ making our products better, or pretty much anything else, reach out!
 
 * Open an issue in our github repo for hats-import
     * https://github.com/astronomy-commons/hats-import/issues/new
-* If you're on LSSTC slack, so are we! 
-  `#lincc-frameworks-qa <https://lsstc.slack.com/archives/C062LG1AK1S>`_
+* Start a new github discussion
+    * https://github.com/astronomy-commons/hats-import/discussions/new/choose
+* If you're on LSSTC slack, so are we!
+    * `#lincc-frameworks-qa <https://lsstc.slack.com/archives/C062LG1AK1S>`_
+    * `#lincc-frameworks-lsdb <https://lsstc.slack.com/archives/C04610PQW9F>`_
+* Join the working group, where we discuss HATS standardization and early results
+    * Google group: `[email protected] <https://groups.google.com/g/hipscat-wg>`_
+    * You can listen in to demo meetings, or ask questions during co-working sessions. 
+      Events are published on a google calendar, embedded below.
+    * Key:
+        * "HATS/LSDB Working Meeting" - Drop-in co-working/office hours. 
+        * "HATS/LSDB Europe Working group" - Intended as European time zone friendly
+          discussion of HATS and LSDB. Generally open.
+        * "HATS/LSDB Monthly Demos" - A more structured telecon, with updates from
+          developers and collaborators, and HATS standardization planning.
+        * "LINCC Tech Talk" - Tech Talk series for software spanning LSST.
+        * "LINCC Frameworks Office Hours" - General office hours for anything 
+          related to software maintained by LINCC Frameworks, during LINCC 
+          incubators, or general software advice.
+
+.. raw:: html
+
+   <iframe src="https://calendar.google.com/calendar/embed?height=600&wkst=1&ctz=America%2FNew_York&showPrint=0&src=Y180YTU1MTFiMDJiNjQ0OTlkNzIxNGE3Y2Y1NWY3NTE3NTY5YmE5NjQ1Y2FiMWM0YzA4YTdjYTQxYTIwNDE3YWQ1QGdyb3VwLmNhbGVuZGFyLmdvb2dsZS5jb20&src=NWI3MDkyYTAxOTZlMjkwODQ4ODEwOGYzMTk2NjM3Yjg0MzU4ZWNlNjIwMzJkYTVhYzY4ZWRjMGIwNGM5ZWFkNUBncm91cC5jYWxlbmRhci5nb29nbGUuY29t&color=%23F4511E&color=%23F09300" style="border:solid 1px #777" width="800" height="600" frameborder="0" scrolling="no"></iframe>
+
+However you reach out, we want to make sure that any discourse is open and 
+inclusive, and we ask that everyone involved read and adhere to the 
+`LINCC Frameworks Code of Conduct <https://lsstdiscoveryalliance.org/programs/lincc-frameworks/code-conduct/>`_
diff --git a/docs/guide/contributing.rst b/docs/guide/contributing.rst
@@ -1,63 +1,69 @@
 Contributing to hats-import
 ===============================================================================
 
-Find (or make) a new GitHub issue
--------------------------------------------------------------------------------
+HATS, hats-import, and LSDB are primarily written and maintained by LINCC Frameworks, but we
+would love to turn it over to the open-source scientific community!! We want to 
+make sure that any discourse is open and inclusive, and we ask that everyone
+involved read and adhere to the 
+`LINCC Frameworks Code of Conduct <https://lsstdiscoveryalliance.org/programs/lincc-frameworks/code-conduct/>`_
 
-Add yourself as the assignee on an existing issue so that we know who's working 
-on what. (If you're not actively working on an issue, unassign yourself).
+Installation from Source
+------------------------
 
-If there isn't an issue for the work you want to do, please create one and include
-a description.
+To install the latest development version of hats-import you will want to build it from source. 
+First, with your virtual environment activated, type in your terminal:
 
-You can reach the team with bug reports, feature requests, and general inquiries
-by creating a new GitHub issue.
+.. code-block:: bash
 
-.. tip::
-   Want to help?
+    git clone https://github.com/astronomy-commons/hats-import
+    cd hats-import/
 
-   Do you want to help out, but you're not sure how? :doc:`/guide/contact`
+To install the package and dependencies you can run the ``setup_dev`` script which installs all 
+the requirements to setup a development environment.
 
-Create a branch
--------------------------------------------------------------------------------
+.. code-block:: bash
 
-It is preferable that you create a new branch with a name like 
-``issue/##/<short-description>``. GitHub makes it pretty easy to associate 
-branches and tickets, but it's nice when it's in the name.
+    chmod +x .setup_dev.sh
+    ./.setup_dev.sh
 
-Setting up a development environment
--------------------------------------------------------------------------------
+Finally, to check that your package has been correctly installed, run the package unit tests:
 
-Before installing any dependencies or writing code, it's a great idea to create a
-virtual environment. LINCC-Frameworks engineers primarily use `conda` to manage virtual
-environments. If you have conda installed locally, you can run the following to
-create and activate a new environment.
+.. code-block:: bash
 
-.. code-block:: console
+    python -m pytest
 
-   >> conda create env -n <env_name> python=3.10
-   >> conda activate <env_name>
+Find (or make) a new GitHub issue
+-------------------------------------------------------------------------------
+
+Add yourself as the assignee on an existing issue so that we know who's working
+on what. If you're not actively working on an issue, unassign yourself.
 
+If there isn't an issue for the work you want to do, please create one and include
+a description.
 
-Once you have created a new environment, you can install this project for local
-development using the following command:
+You can reach the team with bug reports, feature requests, and general inquiries
+by creating a new GitHub issue.
 
-.. code-block:: console
+Note that you may need to make changes in multiple repos to fully implement new
+features or bug fixes! See related projects:
 
-   >> source .setup_dev.sh
+* HATS (`on GitHub <https://github.com/astronomy-commons/hats>`_ 
+  and `on ReadTheDocs <https://hats.readthedocs.io/en/stable/>`_)
+* LSDB (`on GitHub <https://github.com/astronomy-commons/lsdb>`_
+  and `on ReadTheDocs <https://docs.lsdb.io>`_)
 
+Fork the repository
+-------------------------------------------------------------------------------
 
-Notes:
+Contributing to hats-import requires you to `fork <https://github.com/astronomy-commons/hats-import/fork>`_ 
+the GitHub repository. The next steps assume the creation of branches and PRs are performed from your fork.
 
-1) The single quotes around ``'[dev]'`` may not be required for your operating system.
-2) ``pre-commit install`` will initialize pre-commit for this local repository, so
-   that a set of tests will be run prior to completing a local commit. For more
-   information, see the Python Project Template documentation on
-   `pre-commit <https://lincc-ppt.readthedocs.io/en/stable/practices/precommit.html>`_.
-3) Installing ``pandoc`` allows you to verify that automatic rendering of Jupyter notebooks
-   into documentation for ReadTheDocs works as expected. For more information, see
-   the Python Project Template documentation on
-   `Sphinx and Python Notebooks <https://lincc-ppt.readthedocs.io/en/stable/practices/sphinx.html#python-notebooks>`_.
+.. note::
+
+    If you are (or expect to be) a frequent contributor, you should consider requesting
+    access to the `hats-friends <https://github.com/orgs/astronomy-commons/teams/hats-friends>`_
+    working group. Members of this GitHub group should be able to create branches and PRs directly
+    on LSDB, hats and hats-import, without the need of a fork.
 
 Testing
 -------------------------------------------------------------------------------
@@ -72,43 +78,52 @@ paths. These are defined in ``conftest.py`` files. They're powerful and flexible
 Please add or update unit tests for all changes made to the codebase. You can run
 unit tests locally simply with:
 
-.. code-block:: console
+.. code-block:: bash
 
-    >> pytest
+    pytest
 
 If you're making changes to the sphinx documentation (anything under ``docs``),
 you can build the documentation locally with a command like:
 
-.. code-block:: console
+.. code-block:: bash
 
-    >> cd docs
-    >> make html
+    cd docs
+    make html
 
-Create your PR
--------------------------------------------------------------------------------
+We also have a handful of automated linters and checks using ``pre-commit``. You
+can run against all staged changes with the command:
+
+.. code-block:: bash
 
-Please use PR best practices, and get someone to review your code.
+    pre-commit
 
-The LINCC Frameworks guidelines and philosophy on code reviews can be found on 
-`our wiki <https://github.com/lincc-frameworks/docs/wiki/Design-and-Code-Review-Policy>`_.
+Create a branch
+-------------------------------------------------------------------------------
 
-We have a suite of continuous integration tests that run on PR creation. Please
-follow the recommendations of the linter.
+It is preferable that you create a new branch with a name like
+``issue/##/<short-description>``. GitHub makes it pretty easy to associate
+branches and tickets, but it's nice when it's in the name.
 
-Merge your PR
+Create your PR
 -------------------------------------------------------------------------------
 
-The author of the PR is welcome to merge their own PR into the repository.
+You will be required to get your code approved before merging into main.
+If you're not sure who to send it to, you can use the round-robin assignment
+to the ``astronomy-commons/lincc-frameworks`` group.
+
+We have a suite of continuous integration checks that run on PR creation. Please
+follow the code quality recommendations of the linter and formatter, and make sure
+every pipeline passes before submitting it for review.
 
-Optional - Release a new version
+Merge your PR
 -------------------------------------------------------------------------------
 
-Once your PR is merged you can create a new release to make your changes available. 
-GitHub's `instructions <https://docs.github.com/en/repositories/releasing-projects-on-github/managing-releases-in-a-repository>`_ for doing so are here. 
-Use your best judgement when incrementing the version. i.e. is this a major, minor, or patch fix.
+When all the continuous integration checks have passed and upon receiving an
+approving review, the author of the PR is welcome to merge it into the repository.
 
-Be kind
+Release new version
 -------------------------------------------------------------------------------
 
-You are expected to comply with the 
-`LINCC Frameworks Code of Conduct <https://lsstdiscoveryalliance.org/programs/lincc-frameworks/code-conduct/>`_`.
+New versions are manually tagged and automatically released to pypi. To request
+a new release of LSDB, HATS, and hats-import packages, create a 
+`release ticket <https://github.com/astronomy-commons/lsdb/issues/new?assignees=delucchi-cmu&labels=&projects=&template=4-release_tracker.md&title=Release%3A+>`_.
diff --git a/docs/guide/hipscat_conversion.rst b/docs/guide/hipscat_conversion.rst
@@ -112,11 +112,6 @@ You must specify where you want your HATS table to be written, using
 ``output_path``. This path should be the base directory for your catalogs, as 
 the full path for the HATS table will take the form of ``output_path/output_artifact_name``.
 
-If there is already catalog data in the indicated directory, you can 
-force new data to be written in the directory with the ``overwrite`` flag. It's
-preferable to delete any existing contents, however, as this may cause 
-unexpected side effects.
-
 If you're writing to cloud storage, or otherwise have some filesystem credential
 dict, initialize ``output_path`` using ``universal_pathlib``'s utilities.
 
@@ -134,4 +129,4 @@ What next?
 
 You can validate that your new HATS catalog meets both the HATS/LSDB expectations,
 as well as your own expectations of the data contents. You can follow along with the
-`Manual catalog verification <https://docs.lsdb.io/en/stable/tutorials/manual_verification.html>`_.
+`Manual catalog verification <https://docs.lsdb.io/en/stable/tutorials/pre_executed/manual_verification.html>`_.
diff --git a/docs/guide/index_table.rst b/docs/guide/index_table.rst
@@ -157,30 +157,17 @@ list along to your ``ImportArguments``!
     indexing_column="target_id"
 
     ## you might not need to change anything after that.
-    total_metadata = file_io.read_parquet_metadata(os.path.join(input_catalog_path, "_metadata"))
-
-    # This block just finds the indexing column within the _metadata file
-    first_row_group = total_metadata.row_group(0)
-    index_column_idx = -1
-    for i in range(0, first_row_group.num_columns):
-        column = first_row_group.column(i)
-        if column.path_in_schema == indexing_column:
-            index_column_idx = i
-
-    # Now loop through all of the partitions in the input data and find the 
-    # overall bounds of the indexing_column.
-    num_row_groups = total_metadata.num_row_groups
-    global_min = total_metadata.row_group(0).column(index_column_idx).statistics.min
-    global_max = total_metadata.row_group(0).column(index_column_idx).statistics.max
-
-    for index in range(1, num_row_groups):
-        global_min = min(global_min, total_metadata.row_group(index).column(index_column_idx).statistics.min)
-        global_max = max(global_max, total_metadata.row_group(index).column(index_column_idx).statistics.max)
+    catalog = hats.read_hats(input_catalog_path)
+    all_stats = catalog.aggregate_column_statistics()
+
+    global_min = all_stats.at[indexing_column, "min_value"]
+    global_max = all_stats.at[indexing_column, "max_value"]
+    num_partitions = len(catalog.get_healpix_pixels())
 
     print("global min", global_min)
     print("global max", global_max)
 
-    increment = int((global_max-global_min)/num_row_groups)
+    increment = int((global_max-global_min)/num_partitions)
 
     divisions = np.append(np.arange(start = global_min, stop = global_max, step = increment), global_max)
     divisions = divisions.tolist()
@@ -223,11 +210,6 @@ You must specify where you want your index table to be written, using
 ``output_path``. This path should be the base directory for your catalogs, as 
 the full path for the index will take the form of ``output_path/output_artifact_name``.
 
-If there is already catalog or index data in the indicated directory, you can 
-force new data to be written in the directory with the ``overwrite`` flag. It's
-preferable to delete any existing contents, however, as this may cause 
-unexpected side effects.
-
 If you're writing to cloud storage, or otherwise have some filesystem credential
 dict, initialize ``output_path`` using ``universal_pathlib``'s utilities.
 

diff --git a/docs/guide/margin_cache.rst b/docs/guide/margin_cache.rst
@@ -135,11 +135,6 @@ You must specify where you want your margin data to be written, using
 ``output_path``. This path should be the base directory for your catalogs, as 
 the full path for the margin will take the form of ``output_path/output_artifact_name``.
 
-If there is already catalog or margin data in the indicated directory, you can 
-force new data to be written in the directory with the ``overwrite`` flag. It's
-preferable to delete any existing contents, however, as this may cause 
-unexpected side effects.
-
 If you're writing to cloud storage, or otherwise have some filesystem credential
 dict, initialize ``output_path`` using ``universal_pathlib``'s utilities.