- feat: support pyspark < 3 when distributing image-to-dataset job
- feat: support min image size + max aspect ratio (@borisdayma)
- feat: allow encoding in different formats (thanks @borisdayma)
- Fix error message for incorrect input format
- Bug fix: shard id was incorrect when resuming (thanks @lxj616)
- Implement shard retrying
- Validate input and output format
- Implement incremental mode
- use pyarrow in the reader to make it much faster
- use 2022.1.0 of fsspec for python3.6
- fix fsspec version
- fix fsspec version
- add gcsfs to pex
- buffered writer fix: release ram more often
- feat: accept numpy arrays (thanks @borisdayma)
- add tfrecord output format (thanks @borisdayma)
- fix an interaction between md5 and exif option
- fix dependency ranges
- use exifread-nocycle to avoid cycle in exifread
- retry whole sharding if it fails
- retry writing the shard in reader in case of error
- small fix for logger and continuing
- use time instead of perf_counter to measure shard duration
- make metadata writer much faster by building the schema in the downloader instead of guessing it
- add new option allowing to disable reencoding
- hide opencv warning
- force one thread for opencv
- make total logger start time the minimum of workers start time
- add s3fs into the released pex for convenience
- make sharding faster on high latency fs by using a thread pool
- fix logger on s3: do not use listing caching in logger
-
add tutorial on how to setup a spark cluster and use it for distributed img2dataset better aws s3 support:
-
initialize logger fs in subprocess to avoid moving fs over a fork()
-
use spawn instead of fork method
-
make total logging more intuitive and convenient by logging every worker return
- fix release regex
- fix fsspec support by using tmp_dir in main.py
- fix pex creation
- add option not to write
- try catch in the logger for json.load
- prevent error if logger sync is called when no call has been done
- Add a build-pex target in Makefile and CI
- decrease default log interval to 5s
- add option to retry http download
- add original_width by default for a consistent schema
- fix relative path handling
- Add multi distributor support : multiprocessing and pyspark
- make the reader emits file paths instead of samples
- use a logger process to make logging distribution friendly, also save json stat files next to folder/tar files
- Use fsspec to support all filesystems
- implement md5 of images feature
- fix null convert in writer
- add parquet writer
- make reader memory efficient by using feather files
- large refactoring of the whole code in submodules
- Enhance image resize processing (esp re downscale) (@rwightman)
- handle transparency (thanks @borisdayma)
- add json input file support
- Add support for .tsv.gz files (thanks @robvanvolt)
- raise clean exception on image decoding error
- remove the \n in urls for txt inputs
- save the error message when resizing fails in metadata
- add type hints to download function
- use semaphores to decrease memory usage
- fix an issue with resize_mode "no"
- optimize listing files is back, sorted is eager so the iterator returned by iglob is ok
- revert last commit, it could cause double iteration on an iterator which can cause surprising behaviors
- optimize listing files (thanks @Skylion)
- fix a bug affecting downloading multiple files
- ensure sharded_images_to_dl is removed from memory at the end of downloading a file
- solve the stemming issue: make keys uniques
- Save empty caption if caption are none instead of not having the caption file
- fix for the new logging feature when cleaning the status dict
- wandb support is back
- support for python 3.6
- convert caption to str before writing
- add back timeout properly
- fixes
- revert wandb for now, code is too complex and there are issues
- feat: custom timeout (thanks @borisdayma)
- feat: support wandb (thanks @borisdayma)
- use albumentations for resizing (thanks @borisdayma)
- depend on pyyaml to be able to use the last webdataset
- feat: handle tsv + center crop (thanks @borisdayma)
- increase stability by closing the pool and tarwriter explictely
- improve memory usage
- glob only input files of the right ext
- add a save_additional_columns option
- Multiple file support
- Status dataframe
- Uses a resizing method less prone to aliasing (thanks @skylion)
- multi processing + multi threading
- add webdataset support and benchmarks
- supports reading as parquet and csv
- fix cli
- add image resizing mode
- fixes
- it works