Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generated Data Shape always 0 #9

Open
iamamiramine opened this issue Jul 23, 2024 · 8 comments
Open

Generated Data Shape always 0 #9

iamamiramine opened this issue Jul 23, 2024 · 8 comments

Comments

@iamamiramine
Copy link

I am facing an issue when generating data using Tabula.

I trained Tabula on the following datasets:

  1. Census
  2. Fake Hotel Guests
  3. Adult
  4. Health
  5. News

However, when generating, the generation loop is stuck because generated data shape is always 0 (num_samples is always greater than gen_data.shape[0]).

I tried re-training, and tried changing the max_length parameter in the sampling function, but it was of no help.

Can you please help me figure out how to fix this issue?

@zhao-zilong
Copy link
Owner

Hi @iamamiramine sorry that I just saw your message. Did you solve it? The reason can be that your max_length is too small so that the generation cannot successfully generate one complete row of data.

@iamamiramine
Copy link
Author

iamamiramine commented Oct 6, 2024

Hello, I tried changing the max_length parameter and it did not work.
Another thing to note is that Fake Hotel Guests dataset consists of 9 columns, so one row from this dataset is relatively short.

@omaralvarez
Copy link

omaralvarez commented Nov 9, 2024

I am also having problems with this, I am using max_length=1024 the maximum, if use more I get a CUDA error, in this dataset I can not get a single sample:

from imblearn.datasets import fetch_datasets

sick = fetch_datasets()['sick']
sick.data.shape

@zhao-zilong
Copy link
Owner

Hi @omaralvarez @iamamiramine

You do not need to set the max_length to 1024 that big, you can uncomment this part of code to see what is the length of your encoded row:

# Use following print to observe encoded token sequence length

Let me know if that helps.

@omaralvarez
Copy link

Yes, I don't think it has to do with max_length, the issue in this case is that some numbers always are outside of the requested ranges in the predicted dataframe, so they are always filtered out. I have tried to switch temperature, k, and training epochs to no avail.

@xxxx-lzw
Copy link

I am facing an issue when generating data using Tabula.

I trained Tabula on the following datasets:

  1. Census
  2. Fake Hotel Guests
  3. Adult
  4. Health
  5. News

However, when generating, the generation loop is stuck because generated data shape is always 0 ( is always greater than ).num_samples``gen_data.shape[0]

I tried re-training, and tried changing the parameter in the sampling function, but it was of no help.max_length

Can you please help me figure out how to fix this issue?

May I ask if your problem is solved, I'm experiencing this problem as well

@tmacleod
Copy link

This problem also occurs if you train with too few epochs. Train with more epochs and sampling speed improves.

@omaralvarez
Copy link

In my case, no matter the epochs it wouldn't work (I trained for a week in a 80GB A100).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants