-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-8564] Removing WriteStatus references in Hoodie Metadata writer flow #12321
[HUDI-8564] Removing WriteStatus references in Hoodie Metadata writer flow #12321
Conversation
b4871da
to
0ed2635
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. changes look localized. Made a first pass on testing, code structure and other things. High level flow seems good.
I am going to take on code structure changes, while going deeper on the correctness/different cases.
@@ -171,6 +171,7 @@ private void init(String fileId, String partitionPath, HoodieBaseFile baseFileTo | |||
try { | |||
String latestValidFilePath = baseFileToMerge.getFileName(); | |||
writeStatus.getStat().setPrevCommit(baseFileToMerge.getCommitTime()); | |||
writeStatus.getStat().setPrevBaseFile(latestValidFilePath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why just store the base file. for e.g if merge handle was called during compaction, dont we need the entire slice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and do other handles set this. - yes? no>? why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, we only support SI for overwrite with latest payload. So, we don't need to embed entire file slice here.
HUDI-8518 will be taken up to fix it for any payload during which we might require entire file slice to be set here.
btw, already AppendHandle adds all logs file from current file slice to HoodieDeltaWriteStat.
...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
Show resolved
Hide resolved
...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
Show resolved
Hide resolved
* | ||
* @param writeStatuses {@code WriteStatus} from the write operation | ||
*/ | ||
private HoodieData<HoodieRecord> getRecordIndexUpserts(HoodieData<WriteStatus> writeStatuses) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to confirm this was all just moved somewhere else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope. we don't need it. We are directly polling the data files to fetch these info w/ this patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/apache/hudi/pull/12321/files#r1857717084
here is the new logic.
we directly read from data files. And so we know the new inserts and new deletes directly.
+ '\'' + ", numWrites=" + numWrites + ", numDeletes=" + numDeletes + ", numUpdateWrites=" + numUpdateWrites | ||
+ ", totalWriteBytes=" + totalWriteBytes + ", totalWriteErrors=" + totalWriteErrors + ", tempPath='" + tempPath | ||
+ '\'' + ", cdcStats='" + JsonUtils.toString(cdcStats) | ||
+ '\'' + ", prevBaseFile=" + prevBaseFile + '\'' + ", numWrites=" + numWrites + ", numDeletes=" + numDeletes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will reformat / clean this up.. Looks too busy
int parallelism = Math.max(Math.min(allWriteStats.size(), metadataConfig.getRecordIndexMaxParallelism()), 1); | ||
String basePath = dataTableMetaClient.getBasePath().toString(); | ||
// we might need to set some additional variables if we need to process log files. | ||
boolean anyLogFiles = allWriteStats.stream().anyMatch(writeStat -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. in one of them we just need the record keys (for SI flow), while in the other, we generate the HoodieRecords directly. So, I could not dedup much.
but I have fixed the duplication in BaseFileRecordKeyParsingUtils.
String fileName = FSUtils.getFileName(writeStat.getPath(), writeStat.getPartitionPath()); | ||
return FSUtils.isLogFile(fileName); | ||
}); | ||
Option<Schema> writerSchemaOpt = Option.empty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same.
if (writeStat.getPath().endsWith(HoodieFileFormat.PARQUET.getFileExtension())) { | ||
return BaseFileRecordParsingUtils.getRecordKeysDeletedOrUpdated(basePath, writeStat, storage).iterator(); | ||
} else { | ||
// for logs, every entry is either an update or a delete |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we throw errors if we find numInserts > 0 for logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as of now no. But I agree that we should throw exception.
public static Set<String> getRecordKeys(String filePath, HoodieTableMetaClient datasetMetaClient, | ||
Option<Schema> writerSchemaOpt, int maxBufferSize, | ||
String latestCommitTimestamp) throws IOException { | ||
if (writerSchemaOpt.isPresent()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code duplication
} | ||
} | ||
|
||
public static List<String> getRecordKeysDeletedOrUpdated(String basePath, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
method with same name in HoodieTableMetadataUtil - getRecordKeysDeletedOrUpdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this one is just for one HoodieWriteStat. the one in HoodieTableMetadataUtil is for an entire HoodieCommitMetadata
@nsivabalan I ll push some changes and do a final review.. you can take on the testing related gaps, after that. |
048c2a8
to
9d8bc7f
Compare
TestSecondaryIndexPruning#testUpdatesReInsertsDeletes
Few observations./qs
After insert + index builds:
After update:
After delete:
After reinsert:
After second update:
|
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
responded to few comments.
...nt/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
Show resolved
Hide resolved
...di-spark-client/src/test/java/org/apache/hudi/client/functional/TestRLIRecordGeneration.java
Outdated
Show resolved
Hide resolved
...di-spark-client/src/test/java/org/apache/hudi/client/functional/TestRLIRecordGeneration.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/metadata/BaseFileRecordParsingUtils.java
Show resolved
Hide resolved
66775ee
to
f15241b
Compare
Addressed all feedback. Follow up jira https://issues.apache.org/jira/browse/HUDI-8597 |
Change Logs
Removing WriteStatus references in Hoodie Metadata writer flow.
This is stacked on top of #12313
Impact
Fixing MDT writes to not rely on RDD
Risk level (write none, low medium or high below)
low
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist