Skip to content

Commit

Permalink
Improved read me and added intructions in benchmark file
Browse files Browse the repository at this point in the history
  • Loading branch information
RSKothari committed Apr 7, 2021
1 parent c8f2b12 commit c3d590b
Show file tree
Hide file tree
Showing 6 changed files with 109 additions and 25 deletions.
3 changes: 3 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"python.pythonPath": "C:\\ProgramData\\Anaconda3\\python.exe"
}
56 changes: 47 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,27 @@

# Data2H5

This tool rapidly converts loose files scattered within any folder into a consolidated H5 file. This allows for faster read operations with lower memory requirement.
**Reasoning**: H5 files consolidate data in contiguous memory sectors.

To learn more about how H5 files work, I refer the reader to this [fantastic article](https://www.oreilly.com/library/view/python-and-hdf5/9781491944981/ch04.html).

Features:
* H5 files speed up training
* Point and go - just give the paths!
* Utilizes all your cores to read data into H5
* Easy steps to incorporate H5 file into data loader
* Allows reading custom data formats by providing your own file reading function
* Allows complex file pruning by supplying your own extension matching routine

## Requirements

`conda install -c anaconda h5py`

`conda install -c conda-forge imageio`

## Commands

Suppose your loose files are spread across a folder. You can use this utility as such:

`python converter --path_images=<PATH_TO_FOLDER> --path_output=<PATH_ENDING_WITH_H5_FILE> --ext=jpg`
Expand All @@ -21,7 +32,7 @@ For example:

## Operation

This script finds all files with the user specified extension within a folder. It uses the `os.walk` utility to find and read all valid data files. These data files can be extracted using their **relative path** from the folder they were extracted.
This script finds all files (images or otherwise) using the `os.walk` utility within a folder which matches the user specified extension. These data files are then consolidated into a single H5 file. Each file can be then be read directly from the H5 file using their **relative path** .
For example:

```
Expand All @@ -37,24 +48,51 @@ h5_obj = h5py.File(path_h5, mode='r')
data = h5_obj['foo/boo/goo/image_0001.jpg'][:]
```

Note that `[:]` in the end reads `data` by value. The command preceding `[:]` is a reference to `data`.

## Advantages

Reading speed increases, lower memory consumption
* Easy to manage
* H5 files improve speed of reading operation
* Lowers memory consumption by leveraging lossless compression
* Partially loads data instead of hosting on RAM - convenient for large datasets
* Utilizes caching to further improve reading speeds when reading same samples again and again

## Custom data types

You can specify a custom data format by specifying `--ext=fancy_ext`. For example:
You can specify a custom file extension by specifying `--ext=fancy_ext`. For example:

`python converter --path_images=<PATH_TO_FOLDER> --path_output=<PATH_ENDING_WITH_H5_FILE> --ext=json --custom_read_func`

You may then add your own custom reading logic in `my_functions.py` in the function `my_read`. To ensure the program reads your custom logic function, please be sure to add a flag `--custom_read_func` which indicates the same.
You may then add your own custom reading logic in `my_functions.py` in the function `my_read`. To ensure the program reads your custom read function, please add the flag `--custom_read_func` which tells the script to ignore the default reader.

## Data loader tricks
## Custom extension pruning!

Coming soon!
You can provide your own file extension matching function in `my_prune` with the template provided by `--ext` flag. For example, if you want to match complex file extensions such as `.FoO0345` with a template extension string `foo`, then you can supply the following code as your own custom prune function.

```
def my_prune(filename_str, ext_str):
# Logic to verify if the extension type is present
# within the filename
return True if ext_str in filename_str.lower() else False
```

## Data loader setup

To leverage H5 files into your training data loader, please refer to `benchmark.py`. There are three easy steps to follow:

* Step 1. Generate a list of all files used during training in the `init` function.
```
with h5py.File(path_h5, 'r') as h5_obj:
self.file_list = list(h5_obj.keys()) # Each key is the relative path to file
```
* Step 2. Open the `H5` reader object **within** the `__getitem__` call. This creates a separate reader object for each individual worker.
```
if not hasattr(self, 'h5_obj'):
self.h5_obj = h5py.File(self.path_h5, mode='r', swmr=True)
```
* Step 3. Add a safe closing operation for the H5 file.
```
def __del__(self, ):
self.h5_obj.close()
```
## Benchmarks

Coming soon!
Expand Down
10 changes: 4 additions & 6 deletions args_maker.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,14 @@ def make_args():
help='specify png or jpg to select ext. default: jpg')
parser.add_argument('--custom_read_func', action='store_true',
help='use custom read function? ')
parser.add_argument('--custom_prune_func', action='store_true',
help='use custom extension matching function')

required_args = parser.add_argument_group('required named arguments')
required_args.add_argument('--path_images', required=True,
required_args.add_argument('--path_images', required=False,
help='abs path to image directory',
default='D:/Datasets/Gaze360/imgs')
required_args.add_argument('--path_output', required=True,
required_args.add_argument('--path_output', required=False,
help='abs path to output H5 file',
default='D:/exp.h5')
args = parser.parse_args()
Expand All @@ -37,8 +39,4 @@ def make_args():
args.path_images = os.path.abspath(args.path_images)
args.path_output = os.path.abspath(args.path_output)

# Delete and create a new H5 file
h5_obj = h5py.File(args.path_output, 'w')
h5_obj.close()

return args
19 changes: 15 additions & 4 deletions benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@


def join_path(path_left, path_right):
# Join two paths
return os.path.join(path_left, path_right)


Expand All @@ -26,11 +27,9 @@ def __init__(self, path_h5, path_folder, mode='H5'):
self.path_folder = path_folder
self.read_from_H5 = True if mode == 'H5' else False

# Step #1: Generate a list of all images
with h5py.File(path_h5, 'r') as h5_obj:
self.file_list = list(h5_obj.keys())

def __del__(self, ):
self.h5_obj.close()

def __len__(self, ):
return len(self.file_list)
Expand All @@ -40,16 +39,27 @@ def __getitem__(self, idx):
entry_str = self.file_list[idx]

if self.read_from_H5:
# Step #2: Create a H5 object within the __getitem__ call
# This creates a H5 reader object for each worker.
if not hasattr(self, 'h5_obj'):
self.h5_obj = h5py.File(self.path_h5, mode='r', swmr=True)

# Reading a datum from the H5 file
datum = self.h5_obj[entry_str][:]
else:
# Read a datum from the physical location
path_file = join_path(self.path_folder, entry_str)
datum = cv2.imread(os.path.abspath(path_file))

# An operation to facilitate benchmarking
datum = cv2.resize(datum,
(224, 244),
interpolation=cv2.INTER_LANCZOS4)
return datum

# Step #3: Add a safe closing operation if the loader is unoperational
def __del__(self, ):
self.h5_obj.close()


if __name__ == '__main__':
Expand All @@ -58,8 +68,9 @@ def __getitem__(self, idx):

bench_obj = benchmark(args['path_output'], args['path_images'])
loader = torch.utils.data.DataLoader(bench_obj,
shuffle=True,
batch_size=48,
num_workers=4)
num_workers=0)

for epoch in range(3):
time_elapsed = []
Expand Down
40 changes: 34 additions & 6 deletions converter.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,25 +19,35 @@ def join_path(path_left, path_right):


class capture_within_H5():
'''
A class to find all files with the given extension and store them
into a H5 file.
'''
def __init__(self, args):
self.args_dict = vars(args)
# Initialize the converter and generate a list of valid files
self.args_dict = args
root_dir_file_list = self.create_tree()
list_of_valid_datum = self.prune_tree(root_dir_file_list)
self.list_of_valid_datum = self.generate_full_path(list_of_valid_datum)

def generate_full_path(self, paths):
# Given a list of os.walk tuples, append the root path
# to all individual relative paths
full_paths = []
for path in paths:
vals = [join_path(path[0], ele) for ele in path[1]]
full_paths.append(vals)
return list(itertools.chain(*full_paths))

def create_tree(self):
# Generate a tree of a files in a given directory
return list(os.walk(self.args_dict['path_images']))

def prune_tree(self, tree):
# Prune a tree by only accepting files which match the
# provided file extension
list_of_valid_datum = []
for (dir_, dirs, files) in tree:
for (dir_, _, files) in tree:
valid_files = self.prune_files(files)
if any(valid_files):
rel_root = os.path.relpath(dir_, self.args_dict['path_images'])
Expand All @@ -46,7 +56,12 @@ def prune_tree(self, tree):

def prune_files(self, files):
if any(files):
return [fi for fi in files if fi.endswith(self.args_dict['ext'])]
if self.args_dict['custom_prune_func']:
return [fi for fi in files
if self.default_prune(fi, self.args_dict['ext'])]
else:
return [fi for fi in files
if my_prune(fi, self.args_dict['ext'])]
else:
return []

Expand Down Expand Up @@ -93,7 +108,6 @@ def read_write(self, ):
def read_function(self, path_sample):

if self.args_dict['custom_read_func']:
from my_functions import my_read
data = my_read(path_sample)
else:
data = self.default_reader(path_sample)
Expand All @@ -106,18 +120,32 @@ def read_function(self, path_sample):

return True
else:

return data

def default_reader(self, path_sample):
# Read operation for jpg or png images
path_image = join_path(self.args_dict['path_images'], path_sample)
image = imageio.imread(path_image)
return path_sample, image

def default_prune(self, filename_str, ext_str):
# Given an input file name, return true if they match else False
return filename_str.endswith(ext_str)


if __name__ == '__main__':
args = make_args()
args = vars(make_args())
obj = capture_within_H5(args)

if args['custom_read_func']:
from my_functions import my_read

if args['custom_prune_func']:
from my_functions import my_prune

# %% Delete and create a new H5 file
h5_obj = h5py.File(args.path_output, 'w')
h5_obj.close()

# %% Begin reading and writing to H5
obj.read_write()
6 changes: 6 additions & 0 deletions my_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,11 @@ def my_read(path_sample):
return path_sample, datum


def my_prune(filename_str, ext_str):
# Logic to verify if the extension type is present
# within the filename
return True


if __name__ == '__main__':
print('Entry point script is converter.py. Follow instructions in README.')

0 comments on commit c3d590b

Please sign in to comment.