Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export has duplicates #47

Open
uchajk opened this issue Feb 5, 2023 · 11 comments
Open

Export has duplicates #47

uchajk opened this issue Feb 5, 2023 · 11 comments

Comments

@uchajk
Copy link
Contributor

uchajk commented Feb 5, 2023

Export yields unexpected results. We see multiple duplicates in the exported excel file.
The problem stems from this line:

$query = $dataTable->getFilteredQuery()->lazy($chunkSize);

Notice that if the filtered query is not ordered by id (or any unique column), the order of the rows might differ between the calls.
I would suggest using lazyById, but then we would have to pass the name of the column we want to use for lazyById.

@yajra
Copy link
Owner

yajra commented Feb 6, 2023

Maybe we can use the value set in the ->setRowId() call?

However, I think doing so will be a change in behavior. Some of our users expect what they see on the screen would be in the same order when it was exported.

I am in favor of using lazyById but it should not be the default behavior. We can put it config I guess?

@uchajk
Copy link
Contributor Author

uchajk commented Feb 6, 2023

I was actually thinking the opposite...
When using chunking or lazy, you have to guarantee the order of rows by an unique column. If not, you take a risk of getting different order of results between the execution of the query, and therefore missing some rows and duplicating some.
In fact, getFilteredQuery() could be ordered by any column chosen by the user. I was going to suggest the following solution:

$query = $dataTable->getFilteredQuery()->reorder();
$query = $query->orderBy($someUniqueColumn)->lazy($chunkSize);

The $someUniqueColumn can come from the $dataTable object, or passed down from the job.
I understand that users expect order to be preserved when exporting, but I think accurate result takes precedence. Unless someone can come up with a better solution, I don't see how you can offer both.

@yajra
Copy link
Owner

yajra commented Feb 7, 2023

Can you provide snippets to reproduce the duplicates? I asked around my users and they haven't encountered any duplicates yet. Largest export so far is ~200k.

@uchajk
Copy link
Contributor Author

uchajk commented Feb 7, 2023

I'm not allowed to share any code snippets, but I've created a demo:
https://github.com/uchajk/duplicates

Be sure to run UserSeeder.
We are using pgsql.

if you try to export users, you will actually see the duplicates in the excel spreadsheet. Obviously, if you lower chunk size, you will see more duplicates.
Then if you try to sort by id and export again, in this case you don't see duplicates. This happens because when ordering by id, order is guaranteed across all executions of the filtered query when using lazy method. When sorting by "Updated At" which has the same values for all entries, order is different across the executions of the query.

@yajra
Copy link
Owner

yajra commented Feb 8, 2023

I tested your demo app using SQLite. I was able to filter 28 records and export yields to the same result. Filtered 1 and all 1k records, and got the expected results.

Might be pgsql specific issue? Will try to set up a pgsql later.

@uchajk
Copy link
Contributor Author

uchajk commented Feb 8, 2023

@yajra $chunkSize has to be smaller than total number of records in filteredQuery.
It's the different order between the calls of the query that results in duplicates.
If chunk size is 1k, exporting 1k entries will only execute the query 1 time and there is no possibility of duplicates at all.
If you were trying to export 6,752 entries with $chunkSize 1k, the query will get executed 7 times, each of the execution possibly having different order. Therefore, the entries that you processed in the first execution (0 to 999) might appear in the second execution (1000 to 1999) and you would be exporting these entries second time.
Again, this could only happen if the filteredQuery is not ordered by an unique column (e.g. Updated At).

I suggest running php artisan db:seed --class=UserSeeder

@yajra
Copy link
Owner

yajra commented Feb 8, 2023

The seeder does not work on SQLite and always hits a unique constraint issue.

I tried to export 7275 records and got the expected output.

    public function query(): Builder
    {
        return DB::table('users')
                 ->where('name', 'like', '%a%');
    }

UI

image

Excel

image

@uchajk
Copy link
Contributor Author

uchajk commented Feb 8, 2023

@github-actions
Copy link

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Mar 11, 2023
@yajra
Copy link
Owner

yajra commented Mar 14, 2023

@uchajk maybe orderFixed option would help in this case? Not tested but maybe you can add a post order defaulted to ID.

@yajra yajra added for review and removed stale labels Mar 14, 2023
@uchajk
Copy link
Contributor Author

uchajk commented Mar 15, 2023

@uchajk maybe orderFixed option would help in this case? Not tested but maybe you can add a post order defaulted to ID.

Yeah sounds like it might. I'll need to find some time to test this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants