Pure-python port of J. Helly's read_eagle.
If you're used to using the original read_eagle
, all you should need to do is change import statements from import read_eagle
to import pyread_eagle as read_eagle
, or similar (from pyread_eagle import EagleSnapshot
, etc.). Existing scripts should work seamlessly with this change.
Install from pip with:
- 'pip install pyread_eagle'
Install from github:
- Download via web UI, or 'git clone https://github.com/kyleaoman/pyread_eagle.git'
- Install dependencies if necessary (see 'setup.py'), some may be found in other repositories by kyleaoman.
- Global install (Linux):
- cd to directory with 'setup.py'
- run 'sudo pip install -e .' (-e installs via symlink, so pulling repository will do a 'live' update of the installation)
- User install (Linux):
- cd to directory with 'setup.py'
- ensure '~/lib/python3.7/site-packages' or similar is on your PYTHONPATH (e.g. 'echo $PYTHONPATH'), if not, add it (perhaps in .bash_profile or similar)
- run 'pip install --prefix ~ -e .' (-e installs via symlink, so pulling repository will do a 'live' update of the installation)
- cd to a directory outside the module and launch python; you should be able to do 'from pyread_eagle import *'
Alternately, pip can retrieve and install the version on github directly:
- 'pip3 install git+https://github.com/kyleaoman/pyread_eagle.git'
The documentation below is adapted from the original read_eagle readme file.
These routines can be used if Gadget was run with the -DEAGLE_SORT_OUTPUT
flag enabled. They provide a quick way to read in spatial regions
without having to read all the data or even open all of the files.
It works by splitting the simulation volume into a grid and ensuring that particles in the same grid cell are stored consecutively in the snapshot files. Some extra datasets are added to the snapshot which specify which grid cells are stored in which files and the location of the start of each cell in the file.
The procedure to read in data is as follows:
- Open the snapshot by specifying the name of one file
- Flag the grid cells to be read in by calling
select_region
one or more times. - Call
read_dataset
once for each quantity to be read, specifying which particle type to read and the hdf5 dataset name.
You can then either close/delete the snapshot object or call clear_selection
to
read in a different region.
Any of the datasets in the PartTypeX
groups can be read in. The code
makes no assumptions about what datasets are present in the snapshot.
Unlike the original read_eagle
, this version is written directly in python
and therefore does not depend on having a properly configured C compiler
or hdf5 libraries, all that is needed is python, numpy and h5py.
Python2 is not supported. I think pyread_eagle
will run in python2.7,
but since python2 is deprecated as of Jan 2020, I have not and have no intention
to test this.
from pyread_eagle import EagleSnapshot
snap = EagleSnapshot("./snap_020.0.hdf5")
snap.select_region(4.0, 5.0, 2.0, 3.0, 3.0, 4.0)
xyz = snap.read_dataset(0, "Coordinates")
ids = snap.read_dataset(0, "ParticleIDs")
del snap
pyread_eagle
is comparable in speed to read_eagle
. The actual read_dataset calls
vary depending on the region (how contiguous it is in the dataset) between very roughly 2x faster
and 5x slower than the C version. Given the limited flexibility of the h5py
API, I believe it
will be difficult to speed this up further.
If a routine fails it will raise an exception. Because of differences in
the structure of the code relative to read_eagle
, not all the same sanity
checks are made, not all the same errors can arise, and therefore not all
the same error messages may appear. I have also chosen to be more lax and
not try to catch every edge case by hand, rather letting failures happen
and be reported organically by the underlying python libraries. Please
report any errors whose messages are obscure, and of course any bugs.
Note that the output of pyread_eagle will differ from the original version when the sampling rate is set < 1.0. In the original version, a subsample was drawn by comparing a pseudo-random number to the sampling rate for each particle. This resulted in a subsample of approximately the requested fraction of particles. Since the seed was set explicitly, the results were reproducible. Since pyread_eagle uses the numpy PRNG, the output cannot match what would be obtained using the C PRNG. Instead, a random sample of consistent length is drawn (the seed is still set explicitly, so results remain reproducible run-to-run), equal to floor(number of particle * sampling rate), evaluated file-by-file. So, the overall total number of particles times the sampling rate will not always be equal to the length of the output array, due to rounding errors (the largest possible difference is the number of files in the snapshot).
snap = pyread_eagle.EagleSnapshot(fname, verbose=False)
This opens the snapshot which contains the specified file. Closing explicitly is not necessary: in python the snapshot will be automatically deallocated once there are no more references to it.
Parameters
fname
: name of any one file in the snapshotverbose
: ifTrue
, print extra messages during function calls (default: False)
Return value
An instance of the EagleSnapshot
class
del snap
snap.close()
Deallocates memory associated with the snap object.
Not strictly necessary - will happen automatically when the snap variable goes
out of scope. Attempting to call methods of a closed snapshot raises an
EagleSnapshotClosedException
.
snap.select_region(xmin, xmax, ymin, ymax, zmin, zmax)
All grid cells overlapping the specified region are flagged to be read
in by subsequent read_dataset calls
. You can call select_region
multiple
times to make oddly shaped or disjoint selections.
If selected regions overlap or the same region is selected multiple times particles in these regions will still only be read in once.
Parameters
xmin
: the minimum x coordinate of the region to readxmax
: the maximum x coordinate of the region to readymin
: the minimum y coordinate of the region to readymax
: the maximum y coordinate of the region to readzmin
: the minimum z coordinate of the region to readzmax
: the maximum z coordinate of the region to read
snap.select_grid_cells(ixmin, ixmax, iymin, iymax, izmin, izmax)
All grid cells in the specified range of grid coordinates are flagged to
be read in by subsequent read_dataset
calls. You can call select_grid_cells
multiple times to make oddly shaped or disjoint selections.
The coordinates ixmin, ixmax etc are integer coordinates in the hash grid,
starting from zero. The maximum coordinate is (2**hashbits)-1
. The value of
hashbits is stored in snap.hashbits
.
If selected regions overlap or the same region is selected multiple times particles in these regions will still only be read in once.
Parameters
ixmin
: the minimum x coordinate of the region to readixmax
: the maximum x coordinate of the region to readiymin
: the minimum y coordinate of the region to readiymax
: the maximum y coordinate of the region to readizmin
: the minimum z coordinate of the region to readizmax
: the maximum z coordinate of the region to read
snap.select_rotated_region(centre, xvec, yvec, zvec, length)
Selects a box-shaped region rotated relative to the simulation box axes.
The [xyz]vec
parameters should be a set of mutually orthogonal unit vectors
defining the axes of the box to be selected.
Parameters
centre
: coordinate centre of the region to select, shape (3, )xvec
: selection box 'x' unit vector in simulation box coordinates, shape (3, )yvec
: selection box 'y' unit vector in simulation box coordinates, shape (3, )zvec
: selection box 'z' unit vector in simulation box coordinates, shape (3, )length
: selection box lengths (not half-length) along each axis, shape (3, )
n = snap.count_particles(itype)
This returns the number of particles of the specified type which will be
read by the next read_dataset
call. Note that only whole grid cells can
be read so some particles outside the selected region may be read in.
These are included in the count.
In python this routine is not usually needed, but it is included for compatibility with codes formerly using the read_eagle python wrapper.
Parameters
itype
: which particle type to count (integer,0-5
)
Return value The number of particles to be read in
file_index, file_offset = snap.get_particle_locations(itype)
This returns two arrays which each have one element for each selected
particle. file_index
contains the index of the file each particle is in.
file_offset
contains the position in the file, numbering from zero.
Parameters
itype
: which particle type to count (integer,0-5
)
Return value
file_index
- integer array with index of the file containing each particle
file_offset
- integer array with position of each particle in its file
data = snap.read_dataset(itype, name)
This reads in the specified dataset for all particles of type itype in the selected region(s). Use repeated calls to read in multiple datasets.
The type of array you get back reflects the type of the dataset in the file.
Parameters
itype
: which particle type to read (integer,0-5
)name
: the HDF5 name of the dataset, relative to thePartTypeX
group
Return value The contents of the dataset for the selected particles
data = snap.read_extra_dataset(itype, name, basename)
This is the same as read_dataset
, except that it reads from a set of
"auxiliary files" comprised of the same number of parts as the snapshot and
which contain additional datasets with the same sorting, distribution across
files, etc. as the datasets in the main snapshot files.
Parameters
itype
: which particle type to read (integer,0-5
)name
: the HDF5 name of the dataset, relative to thePartTypeX
groupbasename
: name (including path, if necessary) of the auxiliary files, omitting the.X.hdf5
portion
snap.clear_selection()
This clears the flags which specify which grid cells should be read in on the next read_dataset() call. If you've already read in a region and you want to read a different region you should call clear_selection() before calling select_region() again.
snap.split_selection(ThisTask, NTask)
This is for use in MPI programs to allow parallel I/O. When running under MPI all read_eagle routines are collective - they must be called on all processes.
The procedure to read a region in parallel is as follows:
- All processes open the snapshot
- All processes select the SAME region
- Call
split_selection
. This causes each processor to select a subset of the originally selected particles. - Call
read_dataset()
on all processors to read the required quantities
This results in all of the particles in the specified region being read in exactly once with the data spread across the MPI processes.
I have not made any attempt to use pyread_eagle with e.g. mpi4py or other MPI-ish implementations in python. If you succeed in this, please get in touch!
Parameters
ThisTask
: rank of this process in an MPI programNTask
: number of processes in an MPI program
snap.set_sampling_rate(rate)
If rate is set < 1 before reading datasets, a random subsample of particles will be read, with the fraction of particles read set by the rate. See the note on random subsamples (above).
Parameters
rate
: fraction of particles to be read (rate >= 1 for all particles)
dsets = snap.datasets(itype)
Returns a list of the available datasets for a given particle type. This can be iterated to read all datasets, for instance.
Parameters
itype
: which particle type to read (integer,0-5
)