Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_all_indices does not do what it says it does #734

Open
eliotgenton opened this issue Jul 24, 2024 · 1 comment
Open

get_all_indices does not do what it says it does #734

eliotgenton opened this issue Jul 24, 2024 · 1 comment

Comments

@eliotgenton
Copy link

def _get_all_indices(self) -> List[int]:

I believe that this function is not intended to do this as this just returns the number of parquet files in a folder

@RasmusOrsoe
Copy link
Collaborator

@eliotgenton thanks. The selection argument in the ParquetDataset refers to the batches of chunked, pre-shuffled batches of data used for training. This is different from the SQLite dataset where the argument specifies individual events, because that data format provides fast random access to individual rows, making it possible to shuffle on the fly. As a result, the _get_all_indices methods are different - as you point out, the function in ParquetDataset returns the total amount of batches (files) available in the directory specified by the user.

I think this is indeed the intended usage of the method, but we could add statements to make this distinction clearer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants