-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading large datasets that do not fit into the ram #143
Comments
hello @nicolasj92, I think indeed that this would be an interesting feature, especially since deep learning allows easily batch training. One of my concerns is to keep the library easy to use, so I'm not sure that directly giving access to dataloaders would be the best way, but we could easily provide support for frequent datasets types ( like parquet files, hdf5 files or some others?) just by providing the path. What is your current need? |
I agree this library has been very quick & easy to get started with for me - but I don't think expanding the API to accept either array-like or generator-like inputs need have that much affect on usability, right? E.g. would be a familiar pattern to those who've worked with Keras before... Trying to handle the end-to-end file loading within the library could raise all sorts of edge cases like "I have a CSV file but it's in ANSI/windows-1255, rather than UTF-8" - whereas just accepting loaders/generators keeps that complexity out of the library and keeps the interface simple to understand but powerful. One way I'd like to be able to use the library if possible is with Amazon SageMaker's Pipe Mode to speed up my training job start-up time... It's almost like local files, except the file can only be read sequentially through exactly once - and whenever you need to read through the data again (e.g. another epoch) you move on to the next copy e.g. starting with ^ I would for sure not blame you for not wanting to add either this kind of file handling complexity or a SageMaker-specific extension to the API... But if |
Hi Optimox, thanks for the quick reply! I agree that ease-of-use should be a major factor. A solution could be to provide a set of standard dataloaders (e.g. np.array, hdf5 ...) but also to allow the api user to code custom dataloaders that inherit from a default dataloader class. What do you think? |
I have to get a closer look at the problem, currently with a very few changes the code would accept instead of I will also take a closer look to see how easily we could expose the dataloaders. |
@nicolasj92 @athewsey anyone willing to discuss more in depth how this could be done? |
@Optimox thanks for your great job plan! would you please release a data loader example for batch training after this request been done? |
Sure, I would be interested and willing to contribute. Sorry for the late reply |
+1 for this. I would also be happy with any hack that makes this possible. |
Hello @Optimox I'm sorry to bring this topic back. Similar to the @nicolasj92 's problem, I need to train the model batch by batch. To be able to do this, I wrote following code and started the training phase. However, I've noticed that, when the current batch iteration is completed, instead of decreasing the loss model tends to start all over again as also attached.
I've also tried to save and reload model on every batch iteration, but it gave me the same results. There is very balanced and homogenous distribution on the dataset and it's batches and it is tested with different methods while having same approach on training part. Do you have any suggestion to cope with this problem? Thanks in advance, Fırat |
@bytesandwines what's your format in disk for What is the size of your dataset? how many rows and columns? About the loss going back up, it might be the learning rate starting from a high value again at every new call to I can start working on a way to train directly from a parquet file or hdf5 file directly but I'm not sure it will respond to everyone's need. |
Hi @Optimox, Thank you for your quick respone. My code is running on a Portable SSD which has exFAT disk format. The iterator code is given below and it only process one CSV file batch by batch where the dataset consists of:
I've tested it with foloowing LR values 0.001, 0.01 and 0.05 but however nothing changed on the behaviour. Moreover, the loss jumps from 0.046 to 0.25 with learning rate 0.05 almost whenever I call the fit method with new batch. Do you think, any pre-control or setup method can cause resetting weights on model (not the whole weights but the higher layers, I've read that kind of limitation on different classification when I was working on different project.)
|
Can you please update your code above by inserting it between two triple back quotes like this ``` it makes things easier to read. I think your problem is different from the one of this thread which is: how do you train a single model with a large dataset that can't fit into RAM. It seems that you are trying to train the same model by batch on different datasets and not on parts of a large dataset. This is not feasible, indeed you can't train the same model with 200 features and then 400 features, they must be different models. I think your code don't use the previous model and simply creates a new tabnet model which is trained on a new dataset each time. There is no solution to your problem to my knowledge, you can't simply add features to an existing model and retrain it as nothing changed. |
Hi @Optimox , My previous code sample is edited according to your suggestion, thank you for that information. Sorry for misleading you, I am facing with the sample problem (using with a large dataset that can't fit into RAM), the Column size with different values are belong to different experiments (Let's assume that we have always 10k). In my experiment, I iterating through 800k sample by taking 20k batch in each step, then using fit method to train my model. I always have the same tabnet model, only create an instance before entering the loop and not calling any other methods until the loop ends. (Note that, I also edited my first comment and the code sample in it, whole iteration code is given there) |
@bytesandwines ok so it seems that you are trying to do a proper training by batch. I just gave a try with the census example notebook by just replacing the
And here is the scores I get:
So everything seems to be running as expected could you share your training loop? |
Hello @Optimox , I'm still testing my code with your suggestion, sorry for the late response. At, first I've just copied your fit example and only changed the loop part with my train_iterator, trained model and It seemed to work. Then, I started to comment out parameters to detect which parameter might produce this problem. I will let you know if I could find a clue, or be sure that fix my problem. I do appriciate your help, |
@Optimox I have a large dataset (100 chunks in parquet) - each chunk fits in memory but not the entire dataset. What would be the best way to train TabNet model on such a dataset? From your example above, it seems like we can call For the census example, this seems to be working:
|
You can indeed train with large chunks that will fit into your memory. As you can see this is not the most elegant solution, and you'll probably need to decay the learning by hand in your first for loop but I think it should work. Reading directly from a pointer to large parquet file would be better but it's currently not available. |
[1] supporting CustomDataset and DataLoader can fix this issue easily. But it seems that this case could not happen. |
Feature request
What is the expected behavior?
More flexibility in supplying data to the fit() function. How can I use this on a dataset that does not fit into the ram of my PC. Can I supply my own dataloader function?
What is motivation or use case for adding/changing the behavior?
The dataset I want to use tabnet on does not fit into my ram.
How should this be implemented in your opinion?
E.g. as a modular dataloader
Are you willing to work on this yourself?
yes
The text was updated successfully, but these errors were encountered: