Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] metric numDeletedRows missing in Delta log when DELETING complete partition #1423

Closed
1 of 3 tasks
keen85 opened this issue Oct 8, 2022 · 7 comments
Closed
1 of 3 tasks
Assignees
Labels
bug Something isn't working

Comments

@keen85
Copy link

keen85 commented Oct 8, 2022

Describe the problem

When performing a DELETE operation on a Delta Table, some operational metrics are added to the Delta log / table history that contain information (attribute operationMetrics) like number of rows (numDeletedRows) and files (numAddedFiles, numRemovedFiles) deleted/added.
See: https://docs.delta.io/latest/delta-utility.html#operation-metrics-keys

However, I noticed that when a complete partition of a partitioned table is deleted via partitionkey, some of those metrics are missing like the very central metric of how many rows were deleted numDeletedRows.

Steps to reproduce

CREATE or REPLACE TABLE TestDeletePartitioned (
  id bigint,
  part string
)
USING DELTA
PARTITIONED BY (part)
;

INSERT INTO TestDeletePartitioned (id, part) values (1,'a'),(2,'a'),(3,'b'),(4,'b'),(5,'c'),(6,'c'),(7,'d'),(8,'d');

DELETE FROM TestDeletePartitioned WHERE id = 1;                 /* only one row is deleted from partiton part=a, one row remains in part=a */
DELETE FROM TestDeletePartitioned WHERE id IN (3, 4);           /* two row are deleted from partiton part=b what effectively corresponds to deleting the whole partition part=b */
DELETE FROM TestDeletePartitioned WHERE part = 'c';             /* complete partition part=c is deleted */
DELETE FROM TestDeletePartitioned WHERE id = 7 AND part = 'd';  /* only one row is deleted from partiton part=d, one row remains in part=d */

DESC HISTORY TestDeletePartitioned;

Observed results

When using the only the partition key for specifying the DELETE condition, the resulting entry in the Delta log does not contain all the operational metrics.
image

Expected results

I'd like to see the numDeletedRows metric in the log also when partitions are deleted.

Environment information

  • Delta Lake version: 1.2.1.4
  • Spark version: 3.2.2.5.0
  • Scala version: 2.12.15

Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?

  • Yes. I can contribute a fix for this bug independently.
  • Yes. I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
  • No. I cannot contribute a bug fix at this time.
@keen85 keen85 added the bug Something isn't working label Oct 8, 2022
@sherlockbeard
Copy link
Contributor

// Adjust for deletes at partition boundaries. Deletes at partition boundaries is a metadata
// operation, therefore we don't actually have any information around how many rows were deleted
// While this info may exist in the file statistics, it's not guaranteed that we have these
// statistics. To avoid any performance regressions, we currently just return a -1 in such cases

@keen85
Copy link
Author

keen85 commented Oct 9, 2022

Thanks @sherlockbeard! That makes sense. I'll propose a change in the documentation so this is reflected there as well.

So, if i want to know the number of deleted rows I'll have do perform a count() right before the delete; with the same predicate.
Or is there a more efficient approach?

EDIT: I just realized, that the documentation is not on GitHub yet (#1307). Would you happen to know, how changes to the documentation can be made?

@sherlockbeard
Copy link
Contributor

sherlockbeard commented Oct 9, 2022

i am also not sure about it , but
maybe https://github.com/delta-io/website ?

@keen85
Copy link
Author

keen85 commented Oct 10, 2022

@sherlockbeard I did not find the source for the relevant page (https://docs.delta.io/latest/delta-utility.html#operation-metrics-keys) on https://github.com/delta-io/website.

@allisonport-db or @zsxwing, can you help out here: Is the doc for https://docs.delta.io/latest/delta-utility.html#operation-metrics-keys already in GIT? I think this is not yet in GIT, but it is planned (#1307).

In the meantime, how can we propose changes to the documentation?

@zsxwing
Copy link
Member

zsxwing commented Oct 10, 2022

@keen85 we are working on migrating our doc to https://github.com/delta-io/website. Will post the update here when it's done.

@rahulsmahadev rahulsmahadev self-assigned this Oct 10, 2022
@rahulsmahadev
Copy link
Collaborator

This should be solved by 2118e64 if the table has stats

@keen85
Copy link
Author

keen85 commented Dec 7, 2023

thanks @rahulsmahadev,
can confirm that it was fixed by 2118e64

@keen85 keen85 closed this as completed Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants