Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] How to produce dataset with millions of data in batches #677

Open
Q-Bug4 opened this issue Apr 3, 2024 · 4 comments
Open

[Question] How to produce dataset with millions of data in batches #677

Q-Bug4 opened this issue Apr 3, 2024 · 4 comments

Comments

@Q-Bug4
Copy link

Q-Bug4 commented Apr 3, 2024

Hi, we love using hollow, it is very nice.

I wanna know if there is a properly way to produce data in batches? Like I have 10 million objects to produce, I wanna produce them divided into 10 parts and produce 1 million objects every time. I need to produce data in batches because my vm does not have enough memory to store 10 million objects.

I am using Incremental and withNumStatesBetweenSnapshots to make it publish snapshot only at begining and at last so that it run like "in batches". But I met a problem that sometimes the Incremental did not publish dataset because some batch do not change the dataset.
I have fork hollow-reference-implementation and make 2 test cases to show what we are looking for. You can check my test cases: ProducerTest

@prasoon16081994
Copy link

@Q-Bug4 Were you able to find a solution?

@shyam4u
Copy link

shyam4u commented Jul 23, 2024

@prasoon16081994 @Q-Bug4 Were you guys able to find any solution for this.

@Q-Bug4
Copy link
Author

Q-Bug4 commented Jul 23, 2024

@prasoon16081994 @shyam4u Not yet.
Now we are still using Incremental to get close to it. The "dataset not changed" issue can be avoided if you check the version at each producing time. Example below:

List<Data> datas = getMillionDatas();
Incremental incremental = getIncremental();

long lastVersion = 0;

for (...) {
    let version = incremental.produce(...);
    // check if version changed, if not, means dataset not changed.
    if (version == lastVersion ) {
        throw new RuntimeException("dataset not changed!");
    }
    lastVersion = version;
}

@prasoon16081994
Copy link

@Q-Bug4
Hey, finally developed a halfway decent understanding of hollow.
Isn't is the correct implementation to not write a delta if the dataset didn't change?

Perhaps if each of the records were needed, there could be some field that could be added to the record that is unique across all records, and marked as the primary key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants