Skip to content

Latest commit

 

History

History
72 lines (47 loc) · 3.76 KB

README.md

File metadata and controls

72 lines (47 loc) · 3.76 KB

NumpyToSQLite

Writing Numpy Arrays to SQLite databases (NumpyToSQLite/CreateDatabasev2.py)

In CreateDatabasev2.py you specify which pulse information in the numpy array you want as a database file. This convertion is then done by writing multiple temporary databases to disk in parallel, that are then merged to one large database in the end. The pulse information is transformed using sklearn.preprocessing.RobustScaler before saved in a .db file. This step can be removed from code or replaced with your own transforms. The code assigns an event number to each event, that will facilitate extraction from the database. Event numbers in this code ranges from 0 to the number of events in the numpy array. The database will contain two fields, truth and features . Truth contains the target information from 'MCInIcePrimary' and features contain the associated pulse information.

Please note that writing .db files is memory intensive. You can decrease the memory usage by decreasing df_size in CreateDatabasev2.py (which is set to 100.000) but this will also increase run time. For reference: It takes around 2 hours to write 4.4 million events to a .db file at n_workers = 4.

CreateDatabasev2.py takes arguments:
--array_path: The path to numpy arrays from the I3-to-Numpy Pipeline I3Cols. E.g: /home/my_awesome_arrays

--key : The field/key of pulse information you want to add to the database. Multiple keys not supported. E.g : 'SplitInIcePulses'

--db_name : The name of your database. E.g: 'myfirstdatabase'

--gcd_path : The path to the gcd.pkl file containing spatial information. This file can be produced via /I3ToNumpy/create_geo_array.py if you don't have it.

--outdir : The Location in which you wish to save the database and the transformers. The script will save the database in yourpath/data and the pickled transformers in yourpath/meta. The transformers can be read using pandas.read_pickle()

--n_workers : The number of workers

Example:

python CreateDatabsesv2.py --array_path ~/numpy_arrays --key 'SplitInIcePulses' --db_name 'ADataBase' -- gcd_path ~/gcd --outdir ~/MyDatabases --n_workers 4 

Suppose we now wanted to extract events (0,1,2,3,4), one could do so by

import pandas as pd
import sqlite3

db_file = "~data/mydbfile.db"

with sqlite3.connect(db_file) as con:
  truth_query   = 'select * from truth where  event_no IN (0,1,2,3,4)'
  truth         = pd.read_sql(truth_query, con)
  
  feature_query = 'select * from features where  event_no IN (0,1,2,3,4)'
  features      = pd.read_sql(feature_query, con)

Notes:
This is effectively a Lite version of https://github.com/ehrhorn/cubedb, a more feature rich pipe-line.

Writing I3-Files to Numpy Arrays

Run the scripts in I3ToNumpy in the following order:
 ./load_cvmfs.sh

Among many things, this loads IceTray , IceCube software required to read I3-files. Now you can write your I3-files to numpy arrays using I3Cols:

 ./makearray.sh

In I3ToNumpy/makearray.sh you can change the path and keys you wish to extract from the I3-files. To create the gcd.pkl file, you can then run:

 ./create_geo_array.py

Notes :
I3ToNumpy/create_geo_array.py was NOT made by me. (source: https://github.com/IceCubeOpenSource/retro/blob/master/retro/i3info/extract_gcd.py.)
If your cvmfs environment doesn't contain i3Cols or other external packages, you can install these on user level using

 pip install --user yourpackage