Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel][Metrics][PR#2] Add SnapshotReport for reporting snapshot construction #3903

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

allisonport-db
Copy link
Collaborator

@allisonport-db allisonport-db commented Nov 26, 2024

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Adds a SnapshotReport for reporting snapshot construction.

We record a SnapshotReport after successfully constructing a snapshot or if an exception is thrown during construction. We use SnapshotContext to propagate and update information about the snapshot construction. For example, we update the "version" as soon as it's resolved for time-travel by timestamp or load latest snapshot queries. This means in the case of the exception we include as much information as is available.

How was this patch tested?

Adds a test suite MetricsReportSuite with unit tests.

Does this PR introduce any user-facing changes?

No.

@allisonport-db allisonport-db force-pushed the metrics-2 branch 2 times, most recently from e1dbe30 to ceb5a56 Compare November 27, 2024 00:36
@allisonport-db allisonport-db changed the title Metrics 2 [Kernel][Metrics][PR#2] Add SnapshotReport for reporting snapshot construction Nov 27, 2024
Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Reviewed the production code, not the tests yet

@@ -105,8 +107,11 @@ public Transaction build(Engine engine) {
// Table doesn't exist yet. Create an initial snapshot with the new schema.
Metadata metadata = getInitialMetadata();
Protocol protocol = getInitialProtocol();
LogReplay logReplay = getEmptyLogReplay(engine, metadata, protocol);
snapshot = new InitialSnapshot(table.getDataPath(), logReplay, metadata, protocol);
SnapshotContext snapshotContext = SnapshotContext.forVersionSnapshot(tablePath, 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the semantics of the SnapshotContext version? i.e. is it the read version? the attempted write version? Is there a differnce or does it matter?

For example, with this code that uses forVersionSnapshot(..., 0) ... we get a snapshotContext that is identical to the case where a table has 3 commits (0, 1, 2) and you time travel to read the table at version 0.

Curious what you think about this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this, my mistake this should've been -1 (which matches the version returned by the log segment & log replay).

In general it is the intended read version. For this specific case, we treat a non-existent table as read version -1. Note, we don't generate a snapshot report for this case since there is no real snapshot to load!

snapshotContext);

// Push snapshot report to engine
engine.getMetricsReporters().forEach(reporter -> reporter.report(snapshot.getSnapshotReport()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: in general, when should we be pushing to the engine?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as soon as the given operation is completed. In this case, as soon as a snapshot is successfully constructed. I viewed this to essentially mean as soon as the metadata/protocol is loaded.

Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just posting what we discussed in a sync! Looks great!

  1. Would be great to use factory methods to create SnapshotReportImpls. One for success/normal cases, one for error case. This should take in the SnapshotQueryContext.

  2. Ensure you have the same exact method / value names between SnapshotMetrics and SnapshotMetricsResult. Also, add docs to both that explain the relationship. i.e. SnapshotMetrics creates SnapshotMetricsResult; SnapshotMetricsResult is created by SnapshotMetrics

path,
expectException = false,
expectedVersion = Optional.of(0),
expectedProvidedTimestamp = Optional.empty(), // No time travel
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we time travel to version 0? Would // No time travel by timestamp be a more accurate comment? And I think this would apply to all the other such instances of this comment, right?

}
}

test("SnapshotReport valid queries") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also test: try and time travel by a bad version (e.g. version 55) or by a bad timestamp (e.g. the year 1985) but on a valid table?

e.g. you could rename this test to SnapshotReport queries on valid table

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So originally I did not add these tests because my main goal with the exception cases was to cover the 3 possible places snapshot construction could fail (outlined here in design doc) instead of trying to cover all possible errors/exceptions during snapshot construction.

  • Error resolving timestamp --> version (only applicable for time-travel by timestamp)
  • Error building the log segment
  • Error constructing the snapshot (loading table protocol and metadata)

But I'm happy to add these two cases anyways. I can just add an additional test for them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So originally I did not add these tests because my main goal with the exception cases was to cover the 3 possible places snapshot construction could fail (outlined here in design doc) instead of trying to cover all possible errors/exceptions during snapshot construction.

Sorry, what's unclear to me is if you didn't plan to include those tests in this PR or if there's no plan to write them in the near future

class MetricsReportSuite extends AnyFunSuite with TestUtils {

///////////////////////////
// SnapshotReport tests //
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many different types of reports do you anticipate we will have? If test + helper code for just SnapshotReport is ~300 LOC, I wonder if we should split this up into different files?

e.g. SnapshotMetricsReportSuite extends MetricsReportTestUtils

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants