Going over all events once without a cache - speed issue #1633
Replies: 2 comments 7 replies
-
Without seeing your file, I can't give a detailed explanation as to why this happens. But, roughly: When you pass Setting What you probably want to do here is read a full row-group, and handle the laziness manually, e.g. def iter_rows_lazy(file, **kwargs):
array = ak.from_parquet(file, lazy=True, **kwargs)
n_row_groups = len(ak.partitions(array))
for i in range(n_row_groups):
yield from ak.from_parquet(file, row_groups=i, **kwargs)
for row in iter_rows_lazy("Documents/Data"):
do_something(row) I've not tested this yet, so it might need some tweaks. |
Beta Was this translation helpful? Give feedback.
-
Thanks for answering this, @agoose77! The one point I want to add is that Awkward 1.x's lazy arrays are being removed: this is the biggest change from 1.x to 2.x. The v2 Also, dask-awkward runs on v2, so you'd be replacing any workflows like, "Load lazy arrays and try to keep them from accidentally reading the whole file while performing computations on a subset of the file," with workflows like, "Load Dask arrays, write computation, then call |
Beta Was this translation helpful? Give feedback.
-
Hey!
So I am having a slight problem with the speed of reading events from a parquet dataset.
Essentially, I have a large dataset in parquet files that I need to loop over to save individually for Pytorch.
Example data point:
ak.Record of :
Now the data is too large to be loaded in at once so I read it lazily:
However, with this, the cache ends up getting bigger and bigger until it crashes, so instead I try:
But now, even though I'm still only reading each datapoint once, it runs far slower than without setting lazy_cache=None. Is there any reason for this to be so much slower even though I am only reading each data point once?
Does anyone have any advice on how I should proceed?
I'm guessing there might be a way to limit the amount that is cached so it still runs quickly without using up more and more ram, however I'm still a bit confused as to why the speed should change between the two scenarios.
Any help would be extremely appreciated!
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions