Writing Data - Feather, Parquet & HDF5 fastest write time File Size - Parquet is smallest uncompressed file, Parquet & HDF5 are smallest compressed Reading Time - Pickle & HDF5 have fastest reading times
Pickle and Feather are intended for short-term storage. Pickle is intended for saving Python objects between work sessions and therefore is only supported by Python. Feather is intended for exchanging data between Python and R.
- change categorical feature as "category" vs "string" (drop 10s of MB)
data["team"] = data["team"].astype("category")
data .info
- optimize numerical attributes with smaller datatype. uint8 & float32 store far less significant digits (drop 100s of MB)
data["age"] = data["age"].astype("uint8")
data["win_prob"] = data["win_prob"].astype("float32")
- store binary attributes as boolean
- provide schema (column name & datatype) during input
sample = pd.read_csv("XYZ.csv",usecols=["team","age","won_before","win_prob"])
ordtypes = { "id1": "object", "id2": "object", "id3": "object", "id4": "object", "id5": "object", "id6": "object", "v1": "object", "v2": "object", "v3": "object", } data = pd.read_csv("path/file.csv", dtype=dtypes)
# Reading
df = pd.read_csv(file_name,
dtype = {...})
# Writing
df.to_csv(file_name,
index = False,
compression = ...) # None or "gzip"
data.pkl binary serializing
# Reading
df = pd.read_pickle(file_name)
# Writing
df.to_pickle(file_name,
compression = ...) # None or "gzip"
data.parquet columnar storage, use pyarrrow engine
# Reading
df = pd.read_parquet(file_name)
# Writing
df.to_parquet(file_name,
engine = "pyarrow",
compression = ...) # None or "gzip"
data.feather Arrow tables or dataframes (from Python or R)
# Reading
df = pd.read_feather(file_name)
# Writing
df.to_feather(file_name,
compression = ...) # None or "zstd"
data.h5 unlimited datatypes, efficient I/O
two format options "fixed" - write fast "table" - slower, but supports searching and subsetting
to use, install tables
or pytables
package
conda install -c conda-forge pytables