Merge pull request #151 from COINtoolbox/150/time_domain_snpcc

150/time domain snpcc
COINtoolbox · Jul 29, 2023 · b8f7cfd · b8f7cfd
2 parents 4e73067 + 6a0e843
commit b8f7cfd
Show file tree

Hide file tree

Showing 10 changed files with 432 additions and 142 deletions.
diff --git a/docs/index.rst b/docs/index.rst
@@ -48,21 +48,12 @@ Next, clone this repository in another chosen location:
 
     (resspect) >>> git clone https://github.com/COINtoolbox/resspect
 
-Navigate to the repository folder and do
-
-.. code-block:: bash
-
-    (resspect) >>> pip install -r requirements.txt
-
-
-You can now install this package with:
+Navigate to the repository folder and you can now install this package with:
 
 .. code-block:: bash
 
     (resspect) >>> pip install -e .
 
-.. hint:: You may choose to create your virtual environment within the folder of the repository. If you choose to do this, you must remember to exclude the virtual environment directory from version control using e.g., ``.gitignore``. 
-
 
 Setting up a working directory
 ------------------------------
@@ -91,9 +82,6 @@ This data was provided by Rick Kessler, after the publication of results from th
 It allows you to run tests and validate your installation.
 
 
-Data for the RESSPECT project can be found in the COIN server. Check the minutes document for the module you are interested in for information about the exact location.
-
-
 Analysis steps
 ==============
 
@@ -163,7 +151,9 @@ Acknowledgements
 
 This work is part of the Recommendation System for Spectroscopic Followup (RESSPECT) project, governed by an inter-collaboration agreement signed between the `Cosmostatistics Initiative (COIN) <https://cosmostatistics-initiative.org/>`_ and the `LSST Dark Energy Science Collaboration (DESC) <https://lsstdesc.org/>`_.
 
-The `COsmostatistics INitiative (COIN) <https://cosmostatistics-initiative.org>`_ receives financial support from `CNRS <http://www.cnrs.fr/>`_ as part of its MOMENTUM programme over the 2018-2020 period, under the project *Active Learning for Large Scale Sky Surveys*.
+The `COsmostatistics INitiative (COIN) <https://cosmostatistics-initiative.org>`_ is an international network of researchers whose goal is to foster interdisciplinarity inspired by Astronomy. 
+
+COIN received financial support from `CNRS <http://www.cnrs.fr/>`_ for the development of this project, as part of its MOMENTUM programme over the 2018-2020 period, under the project *Active Learning for Large Scale Sky Surveys*.
 
 This work would not be possible without intensive consultation to online platforms and
 discussion forums. Although it is not possible to provide a complete list of the open source

diff --git a/docs/learn_loop.rst b/docs/learn_loop.rst
@@ -16,7 +16,7 @@ For start, we can load the feature information:
 
    >>> from resspect import DataBase
 
-   >>> path_to_features_file = 'results/Bazin.dat'
+   >>> path_to_features_file = 'results/Bazin.csv'
 
    >>> data = DataBase()
    >>> data.load_features(path_to_features_file, method='Bazin', screen=True)
@@ -87,9 +87,9 @@ In interactive mode, you must define the required variables and use the :py:mod:
    >>> method = 'Bazin'                               # only option in v1.0
    >>> ml = 'RandomForest'                            # classifier
    >>> strategy = 'RandomSampling'                    # learning strategy
-   >>> input_file = 'results/Bazin.dat'               # input features file
-   >>> metric = 'results/metrics.dat'                 # output metrics file
-   >>> queried = 'results/queried.dat'                # output query file
+   >>> input_file = 'results/Bazin.csv'               # input features file
+   >>> metric = 'results/metrics.csv'                 # output metrics file
+   >>> queried = 'results/queried.csv'                # output query file
    >>> train = 'original'                             # initial training
    >>> batch = 1                                      # size of batch
 
@@ -101,7 +101,7 @@ Alternatively you can also run everything from the command line:
 
 .. code-block:: bash
 
-   >>> run_loop.py -i <input features file> -b <batch size> -n <number of loops>
+   >>> run_loop -i <input features file> -b <batch size> -n <number of loops>
    >>>             -m <output metrics file> -q <output queried sample file>
    >>>             -s <learning strategy> -t <choice of initial training>
 
@@ -110,51 +110,67 @@ Active Learning loop in time domain
 ===================================
 
 Considering that you have previously prepared the time domain data, you can run the active learning loop
-following the same algorithm described in `Ishida et al., 2019 <https://cosmostatistics-initiative.org/portfolio-item/active-learning-for-sn-classification/>`_    by using the :py:mod:`resspect.time_domain_loop` module:
+following the same algorithm described in `Ishida et al., 2019 <https://cosmostatistics-initiative.org/portfolio-item/active-learning-for-sn-classification/>`_    by using the :py:mod:`resspect.time_domain_loop` module.
+
+.. note:: The code below requires a file with features extracted from full light curves from which the initial sample will be drawn.
 
 .. code-block:: python
     :linenos:
 
     >>> from resspect import time_domain_loop
     
-    >>> days = [20, 180]                                # first and last day of the survey to be considered
-    >>> training = 'original'                           # if int take int number of objects for initial training, 50% being Ia
+    >>> days = [20, 180]                                # first and last day of the survey
+    >>> training = 'original'                           # if int take int number of objs 
+                                                        # for initial training, 50% being Ia
+
     >>> strategy = 'UncSampling'                        # learning strategy
-    >>> batch = 1                                       # if int, ignore cost per observation, if None find optimal batch size
-    >>> sep_files = False                               # if True, expects train, test and validation samples in separate files
+    >>> batch = 1                                       # if int, ignore cost per observation, 
+                                                        # if None find optimal batch size
+
+    >>> sep_files = False                               # if True, expects train, test and 
+                                                        # validation samples in separate filess
     
     >>> path_to_features_dir = 'results/time_domain/'   # folder where the files for each day are stored
-    
+
+    >>> # output results for metrics
     >>> output_metrics_file = 'results/metrics_' + strategy + '_' + str(training) + \
-                           '_batch' + str(batch) +  '.dat'                               # output results for metrics
+                           '_batch' + str(batch) +  '.csv'         
+
+    >>> # output query sample
     >>> output_query_file = 'results/queried_' + strategy + '_' + str(training) + \
-                            '_batch' + str(batch) +  '.dat'                              # output query sample
+                            '_batch' + str(batch) +  '.csv'                              
                             
     >>> path_to_ini_files = {}
-    >>> path_to_ini_files['train'] = 'results/Bazin.dat'                                 # features from full light curves for initial training sample 
+
+    >>> # features from full light curves for initial training sample 
+    >>> path_to_ini_files['train'] = 'results/Bazin.csv'
     >>> survey='DES'
-    
+
     >>> classifier = 'RandomForest'
     >>> n_estimators = 1000                             # number of trees in the forest
     
     >>> feature_method = 'Bazin'
-    >>> screen = False                                  # if True will print many intermediate steps for debuging 
-    >>> fname_pattern = ['day_', '.dat']                # pattern on filename where different days of the survey are stored                              
-    >>> queryable= True                                 # if True, check brightness before considering an object queryable
+    >>> screen = False                                  # if True will print many things for debuging 
+    >>> fname_pattern = ['day_', '.csv']                # pattern on filename where different days 
+                                                        # are stored                              
+
+    >>> queryable= True                                 # if True, check brightness before considering 
+                                                        # an object queryable
     
 
     >>> # run time domain loop
     >>> time_domain_loop(days=days, output_metrics_file=output_metrics_file,
-    >>>                  output_queried_file=output_query_file, path_to_ini_files=path_to_ini_files,
+    >>>                  output_queried_file=output_query_file, 
+    >>>                  path_to_ini_files=path_to_ini_files,
     >>>                  path_to_features_dir=path_to_features_dir,
-    >>>                  strategy=strategy, fname_pattern=fname_pattern, batch=batch, classifier=classifier,
-    >>>                  sep_files=sep_files,
+    >>>                  strategy=strategy, fname_pattern=fname_pattern, batch=batch, 
+    >>>                  classifier=classifier,
+    >>>                  sep_files=sep_files, budgets=budgets,
     >>>                  screen=screen, initial_training=training,
     >>>                  survey=survey, queryable=queryable, n_estimators=n_estimators)
 
 
-Make sure you check the full documentation of the module to understand which variables are required depending
-on the case you wish to run.
+Make sure you check the full documentation of the module to understand which variables are required depending on the case you wish to run.
 
 More details can be found in the corresponding `docstring <https://github.com/COINtoolbox/resspect/blob/master/resspect/scripts/run_time_domain.py>`_.
 
@@ -176,4 +192,79 @@ The result will be something like the plot below (accounting for variations due
 Separate samples and Telescope resources
 ----------------------------------------
 
-Beyond the simple learning loop described above, `resspect` also handdles a few batch strategies which take into account the available telescope time for spectroscopic follow-up... TBC
+In a realistic situation, you might like to consider a more complex experiment design. For example, using a fixed validation sample and taking into account the time evolution of the transient and available resources for spectroscopic follow-up. 
+
+The RESSPECT reported an extensive study which takes into account many of the caveats related to realistic astronomical observations. The full report can be found at `Kennamer et al., 2020 <https://cosmostatistics-initiative.org/portfolio-item/resspect1/>`_.
+
+In following the procedure described in `Kennamer et al., 2020 <https://cosmostatistics-initiative.org/portfolio-item/resspect1/>`_, the first step is to separate objects into `train`, `test`, `validation` and `query` samples.
+
+.. code-block:: python
+    :linenos:
+
+    >>> from resspect import sep_samples  
+    >>> from resspect import read_features_fullLC_samples
+
+    >>> # user input
+    >>> path_to_features = 'results/Bazin.csv'
+    >>> output_dir = 'results/'         # output directory where files will be saved
+    >>> n_test_val = 1000               # number of objects in each sample: test and validation
+    >>> n_train = 1500                  # number of objects to be separated for training
+    >>>                                 # this should be big enough to allow tests according
+    >>>                                 # to multiple initial conditions
+
+    >>> # read data and separate samples
+    >>> all_data = pd.read_csv(path_to_features, index_col=False)
+    >>> samples = sep_samples(all_data['id'].values, n_test_val=n_test_val, 
+    >>>                       n_train=n_train)
+
+    >>> # read features and save them to separate files
+    >>> for sample in samples.keys():
+    >>>     output_fname = output_dir + sample + '_bazin.csv'
+    >>>     read_features_fullLC_samples(samples[sample], output_fname,
+                                         path_to_features)
+   
+
+This will save samples to individual files. From these, only the `query` sample needs to be prepared for time domain, following instructions in :ref:`Prepare data for Time Domain <prepare_time_domain>`. Once that is done, there is only a few inputs that needs to be changed in the last call of the `time_domain_loop` function. 
+
+.. code-block:: python
+    :linenos:
+
+    >>> sep_files = True         
+    >>> batch = None                            # use telescope time budgets instead of 
+                                                # fixed number of queries per loop      
+
+    >>> budgets = (6. * 3600, 6. * 3600)        # budget of 6 hours per night of observation
+
+    >>> path_to_features_dir = 'results/time_domain/'  # this is the path to the directory 
+                                                       # where the pool sample
+                                                       # processed for time domain is stored
+
+    >>> path_to_ini_files = {}
+    >>> path_to_ini_files['train'] = 'results/train_bazin.csv'       
+    >>> path_to_ini_files['test'] = 'results/test_bazin.csv'
+    >>> path_to_ini_files['validation'] = 'results/val_bazin.csv'
+
+    
+    >>> # run time domain loop
+    >>> time_domain_loop(days=days, output_metrics_file=output_diag_file,
+    >>>                  output_queried_file=output_query_file, 
+    >>>                  path_to_ini_files=path_to_ini_files,
+    >>>                  path_to_features_dir=path_to_features_dir,
+    >>>                  strategy=strategy, fname_pattern=fname_pattern, 
+    >>>                  batch=batch, classifier=classifier,
+    >>>                  sep_files=sep_files, budgets=budgets,
+    >>>                  screen=screen, initial_training=training,
+    >>>                  survey=survey, queryable=queryable, n_estimators=n_estimators)
+
+The same result can be achieved using the command line using the `run_time_domain` script:
+
+.. code-block:: bash
+    :linenos:
+
+    >>> run_time_domain -d <first day of survey> <last day of survey>
+    >>>        -m <output metrics file> -q <output queried file> -f <features pool sample directory>
+    >>>        -s <learning strategy> -t <training choice>
+    >>>        -fl <path to initial training > -pv <path to validation> -pt <path to test>
+    
+
+.. warning:: Make sure you check the values of the optional variables as well!
diff --git a/docs/prepare_time_domain.rst b/docs/prepare_time_domain.rst
@@ -47,22 +47,30 @@ You can perform the entire analysis for one day of the survey using the `SNPCCPh
    >>> day = 20
    >>> queryable_criteria = 2
    >>> get_cost = True
+   >>> feature_extractor = 'bazin'
+   >>> tel_sizes=[4, 8]
+   >>> tel_names = ['4m', '8m']
+   >>> spec_SNR = 10
+   >>> number_of_processors = 5
 
    >>> data = SNPCCPhotometry()
    >>> data.create_daily_file(output_dir=output_dir, day=day,
    >>>                        get_cost=get_cost)
    >>> data.build_one_epoch(raw_data_dir=path_to_data, 
    >>>                      day_of_survey=day, time_domain_dir=output_dir, 
+   >>>                      feature_extractor=feature_extractor,
    >>>                      queryable_criteria=queryable_criteria, 
-   >>>                      get_cost=get_cost)
+   >>>                      get_cost=get_cost, tel_sizes=tel_sizes, 
+   >>>                      tel_names=tel_names, spec_SNR=spec_SNR,  
+   >>>                      number_of_processors=number_of_processors)
 
 
 Alternatively you can use the command line to prepare a sequence of days in one batch:
 
 .. code-block:: bash
 
    >>> build_time_domain_snpcc.py -d 20 21 22 23 -p <path to raw data dir> 
-   >>>        -o <path to output time domain dir> -q 2 -c True
+   >>>        -o <path to output time domain dir> -q 2 -c True -nc 5
 
 For PLASTiCC
 ^^^^^^^^^^^^

diff --git a/resspect/__init__.py b/resspect/__init__.py
@@ -25,8 +25,9 @@
 from .fit_lightcurves import *
 from .learn_loop import *
 from .metrics import *
-from .query_strategies import *
 from .plot_results import *
+from .query_strategies import *
+from .samples_utils import *
 from .snana_fits_to_pd import *
 from .scripts.build_canonical import build_canonical as build_canonical
 from .scripts.build_time_domain_snpcc import build_time_domain_snpcc as build_time_domain_snpcc
@@ -43,6 +44,11 @@
 from .query_budget_strategies import *
 from .bump import *
 
+import importlib.metadata
+
+__version__ = importlib.metadata.version("resspect")
+
+
 __all__ = ['accuracy',
            'assign_cosmo',
            'bazin',
@@ -88,9 +94,11 @@
            'purity',
            'random_forest',           
            'random_sampling',
+           'read_features_fullLC_samples',
            'read_fits',
            'run_loop',
            'run_time_domain',
+           'sep_samples',
            'SNPCCPhotometry',
            'svm',
            'time_domain_loop',