Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-8564] Removing WriteStatus references in Hoodie Metadata writer flow #12321

Conversation

nsivabalan
Copy link
Contributor

Change Logs

Removing WriteStatus references in Hoodie Metadata writer flow.
This is stacked on top of #12313

Impact

Fixing MDT writes to not rely on RDD

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@nsivabalan nsivabalan force-pushed the rli_dag_fix_mdt_si_simplified_fix_remove_Write_status branch 2 times, most recently from b4871da to 0ed2635 Compare November 23, 2024 05:26
@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Nov 23, 2024
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. changes look localized. Made a first pass on testing, code structure and other things. High level flow seems good.

I am going to take on code structure changes, while going deeper on the correctness/different cases.

@@ -171,6 +171,7 @@ private void init(String fileId, String partitionPath, HoodieBaseFile baseFileTo
try {
String latestValidFilePath = baseFileToMerge.getFileName();
writeStatus.getStat().setPrevCommit(baseFileToMerge.getCommitTime());
writeStatus.getStat().setPrevBaseFile(latestValidFilePath);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why just store the base file. for e.g if merge handle was called during compaction, dont we need the entire slice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and do other handles set this. - yes? no>? why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, we only support SI for overwrite with latest payload. So, we don't need to embed entire file slice here.
HUDI-8518 will be taken up to fix it for any payload during which we might require entire file slice to be set here.
btw, already AppendHandle adds all logs file from current file slice to HoodieDeltaWriteStat.

*
* @param writeStatuses {@code WriteStatus} from the write operation
*/
private HoodieData<HoodieRecord> getRecordIndexUpserts(HoodieData<WriteStatus> writeStatuses) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to confirm this was all just moved somewhere else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope. we don't need it. We are directly polling the data files to fetch these info w/ this patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/hudi/pull/12321/files#r1857717084
here is the new logic.

we directly read from data files. And so we know the new inserts and new deletes directly.

+ '\'' + ", numWrites=" + numWrites + ", numDeletes=" + numDeletes + ", numUpdateWrites=" + numUpdateWrites
+ ", totalWriteBytes=" + totalWriteBytes + ", totalWriteErrors=" + totalWriteErrors + ", tempPath='" + tempPath
+ '\'' + ", cdcStats='" + JsonUtils.toString(cdcStats)
+ '\'' + ", prevBaseFile=" + prevBaseFile + '\'' + ", numWrites=" + numWrites + ", numDeletes=" + numDeletes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will reformat / clean this up.. Looks too busy

int parallelism = Math.max(Math.min(allWriteStats.size(), metadataConfig.getRecordIndexMaxParallelism()), 1);
String basePath = dataTableMetaClient.getBasePath().toString();
// we might need to set some additional variables if we need to process log files.
boolean anyLogFiles = allWriteStats.stream().anyMatch(writeStat -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code duplication

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. in one of them we just need the record keys (for SI flow), while in the other, we generate the HoodieRecords directly. So, I could not dedup much.
but I have fixed the duplication in BaseFileRecordKeyParsingUtils.

String fileName = FSUtils.getFileName(writeStat.getPath(), writeStat.getPartitionPath());
return FSUtils.isLogFile(fileName);
});
Option<Schema> writerSchemaOpt = Option.empty();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same.

if (writeStat.getPath().endsWith(HoodieFileFormat.PARQUET.getFileExtension())) {
return BaseFileRecordParsingUtils.getRecordKeysDeletedOrUpdated(basePath, writeStat, storage).iterator();
} else {
// for logs, every entry is either an update or a delete
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we throw errors if we find numInserts > 0 for logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as of now no. But I agree that we should throw exception.

public static Set<String> getRecordKeys(String filePath, HoodieTableMetaClient datasetMetaClient,
Option<Schema> writerSchemaOpt, int maxBufferSize,
String latestCommitTimestamp) throws IOException {
if (writerSchemaOpt.isPresent()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code duplication

}
}

public static List<String> getRecordKeysDeletedOrUpdated(String basePath,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

method with same name in HoodieTableMetadataUtil - getRecordKeysDeletedOrUpdated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this one is just for one HoodieWriteStat. the one in HoodieTableMetadataUtil is for an entire HoodieCommitMetadata

@vinothchandar
Copy link
Member

@nsivabalan I ll push some changes and do a final review.. you can take on the testing related gaps, after that.

@vinothchandar vinothchandar force-pushed the rli_dag_fix_mdt_si_simplified_fix_remove_Write_status branch from 048c2a8 to 9d8bc7f Compare November 24, 2024 22:31
@vinothchandar
Copy link
Member

vinothchandar commented Nov 26, 2024

TestSecondaryIndexPruning#testUpdatesReInsertsDeletes

trip4 is not found after the second update.

Few observations./qs

  • Things seem to be consistent till the delete. (all traces of trip4 removed from RLI and SI)
  • Even re-insert seems to be going to a different new file id as expected. SI has two pointers back to trip4 from right reverse index values.
  • But trip4 goes missing after an update to rider value. SI seems intact. It feels as though, a delete in RLI happened. Can't explain why .

After insert + index builds:

0 : [trip1,5,city=san_francisco/state=california,-6380283593510729419,-4773949543163463974,0,,1732582749312,0,null,null]
0 : [trip2,5,city=sunnyvale/state=california,1772447978757638219,-5721388430672818475,0,,1732582749312,0,null,null]
0 : [trip3,5,city=austin/state=texas,-5153898977397095021,-5284273978040951062,0,,1732582749312,0,null,null]
0 : [trip4,5,city=houston/state=texas,-7405699635221806163,-9058970397732851524,0,,1732582749312,0,null,null]
0 : [rider-F$trip4,7,null,null,null,null,null,null,null,null,false]
0 : [rider-C$trip2,7,null,null,null,null,null,null,null,null,false]
0 : [rider-A$trip1,7,null,null,null,null,null,null,null,null,false]
0 : [rider-E$trip3,7,null,null,null,null,null,null,null,null,false]
0 : [driver-O$trip3,7,null,null,null,null,null,null,null,null,false]
0 : [driver-P$trip4,7,null,null,null,null,null,null,null,null,false]
0 : [driver-K$trip1,7,null,null,null,null,null,null,null,null,false]
0 : [driver-M$trip2,7,null,null,null,null,null,null,null,null,false]
0 : [__all_partitions__,1,null,null,null,null,null,null,null,null,null]
0 : [city=houston/state=texas,2,null,null,null,null,null,null,null,null,null]
0 : [city=austin/state=texas,2,null,null,null,null,null,null,null,null,null]
0 : [city=sunnyvale/state=california,2,null,null,null,null,null,null,null,null,null]
0 : [city=san_francisco/state=california,2,null,null,null,null,null,null,null,null,null]

After update:

1 : [trip1,5,city=san_francisco/state=california,-6380283593510729419,-4773949543163463974,0,,1732582749312,0,null,null]
1 : [trip2,5,city=sunnyvale/state=california,1772447978757638219,-5721388430672818475,0,,1732582749312,0,null,null]
1 : [trip3,5,city=austin/state=texas,-5153898977397095021,-5284273978040951062,0,,1732582749312,0,null,null]
1 : [trip4,5,city=houston/state=texas,-7405699635221806163,-9058970397732851524,0,,1732582749312,0,null,null]
1 : [rider-E$trip4,7,null,null,null,null,null,null,null,null,false]
1 : [rider-C$trip2,7,null,null,null,null,null,null,null,null,false]
1 : [rider-A$trip1,7,null,null,null,null,null,null,null,null,false]
1 : [rider-E$trip3,7,null,null,null,null,null,null,null,null,false]
1 : [driver-O$trip3,7,null,null,null,null,null,null,null,null,false]
1 : [driver-P$trip4,7,null,null,null,null,null,null,null,null,false]
1 : [driver-K$trip1,7,null,null,null,null,null,null,null,null,false]
1 : [driver-M$trip2,7,null,null,null,null,null,null,null,null,false]
1 : [__all_partitions__,1,null,null,null,null,null,null,null,null,null]
1 : [city=houston/state=texas,2,null,null,null,null,null,null,null,null,null]
1 : [city=austin/state=texas,2,null,null,null,null,null,null,null,null,null]
1 : [city=sunnyvale/state=california,2,null,null,null,null,null,null,null,null,null]
1 : [city=san_francisco/state=california,2,null,null,null,null,null,null,null,null,null]

After delete:

2 : [trip1,5,city=san_francisco/state=california,-6380283593510729419,-4773949543163463974,0,,1732582749312,0,null,null]
2 : [trip2,5,city=sunnyvale/state=california,1772447978757638219,-5721388430672818475,0,,1732582749312,0,null,null]
2 : [trip3,5,city=austin/state=texas,-5153898977397095021,-5284273978040951062,0,,1732582749312,0,null,null]
2 : [rider-C$trip2,7,null,null,null,null,null,null,null,null,false]
2 : [rider-A$trip1,7,null,null,null,null,null,null,null,null,false]
2 : [rider-E$trip3,7,null,null,null,null,null,null,null,null,false]
2 : [driver-O$trip3,7,null,null,null,null,null,null,null,null,false]
2 : [driver-K$trip1,7,null,null,null,null,null,null,null,null,false]
2 : [driver-M$trip2,7,null,null,null,null,null,null,null,null,false]
2 : [__all_partitions__,1,null,null,null,null,null,null,null,null,null]
2 : [city=houston/state=texas,2,null,null,null,null,null,null,null,null,null]
2 : [city=austin/state=texas,2,null,null,null,null,null,null,null,null,null]
2 : [city=sunnyvale/state=california,2,null,null,null,null,null,null,null,null,null]
2 : [city=san_francisco/state=california,2,null,null,null,null,null,null,null,null,null]

After reinsert:

3 : [trip1,5,city=san_francisco/state=california,-6380283593510729419,-4773949543163463974,0,,1732582749312,0,null,null]
3 : [trip2,5,city=sunnyvale/state=california,1772447978757638219,-5721388430672818475,0,,1732582749312,0,null,null]
3 : [trip3,5,city=austin/state=texas,-5153898977397095021,-5284273978040951062,0,,1732582749312,0,null,null]
3 : [trip4,5,city=houston/state=texas,-8294613005162561361,-6714765547006433090,0,,1732582830669,0,null,null]
3 : [rider-C$trip2,7,null,null,null,null,null,null,null,null,false]
3 : [rider-G$trip4,7,null,null,null,null,null,null,null,null,false]
3 : [rider-A$trip1,7,null,null,null,null,null,null,null,null,false]
3 : [rider-E$trip3,7,null,null,null,null,null,null,null,null,false]
3 : [driver-O$trip3,7,null,null,null,null,null,null,null,null,false]
3 : [driver-Q$trip4,7,null,null,null,null,null,null,null,null,false]
3 : [driver-K$trip1,7,null,null,null,null,null,null,null,null,false]
3 : [driver-M$trip2,7,null,null,null,null,null,null,null,null,false]
3 : [__all_partitions__,1,null,null,null,null,null,null,null,null,null]
3 : [city=houston/state=texas,2,null,null,null,null,null,null,null,null,null]
3 : [city=austin/state=texas,2,null,null,null,null,null,null,null,null,null]
3 : [city=sunnyvale/state=california,2,null,null,null,null,null,null,null,null,null]
3 : [city=san_francisco/state=california,2,null,null,null,null,null,null,null,null,null]

After second update:

4 : [trip1,5,city=san_francisco/state=california,-6380283593510729419,-4773949543163463974,0,,1732582749312,0,null,null]
4 : [trip2,5,city=sunnyvale/state=california,1772447978757638219,-5721388430672818475,0,,1732582749312,0,null,null]
4 : [trip3,5,city=austin/state=texas,-5153898977397095021,-5284273978040951062,0,,1732582749312,0,null,null]
4 : [rider-E$trip4,7,null,null,null,null,null,null,null,null,false]
4 : [rider-E$trip2,7,null,null,null,null,null,null,null,null,false]
4 : [rider-A$trip1,7,null,null,null,null,null,null,null,null,false]
4 : [rider-E$trip3,7,null,null,null,null,null,null,null,null,false]
4 : [driver-O$trip3,7,null,null,null,null,null,null,null,null,false]
4 : [driver-Q$trip4,7,null,null,null,null,null,null,null,null,false]
4 : [driver-K$trip1,7,null,null,null,null,null,null,null,null,false]
4 : [driver-M$trip2,7,null,null,null,null,null,null,null,null,false]
4 : [__all_partitions__,1,null,null,null,null,null,null,null,null,null]
4 : [city=houston/state=texas,2,null,null,null,null,null,null,null,null,null]
4 : [city=austin/state=texas,2,null,null,null,null,null,null,null,null,null]
4 : [city=sunnyvale/state=california,2,null,null,null,null,null,null,null,null,null]
4 : [city=san_francisco/state=california,2,null,null,null,null,null,null,null,null,null]

Copy link
Contributor Author

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

responded to few comments.

@github-actions github-actions bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels Nov 27, 2024
@nsivabalan nsivabalan force-pushed the rli_dag_fix_mdt_si_simplified_fix_remove_Write_status branch from 66775ee to f15241b Compare November 27, 2024 20:44
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan
Copy link
Contributor Author

Addressed all feedback. Follow up jira https://issues.apache.org/jira/browse/HUDI-8597

@nsivabalan nsivabalan merged commit 3018c49 into apache:master Nov 28, 2024
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XL PR with lines of changes > 1000
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants