Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpointing POC #4

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Checkpointing POC #4

wants to merge 2 commits into from

Conversation

shirshanka
Copy link
Owner

No description provided.

watermarkToCommit = writerWatermark;
} else {
watermarkToCommit =
(watermarkToCommit.compareTo(writerWatermark) < 0) ? watermarkToCommit : writerWatermark;
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be inverted, the compare logic is incorrect.

*
* @return The lowest watermark out of all pending write requests
*/
Optional<ComparableWatermark> getUnacknowledgedWatermark();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to bit more crisp about the definition / functionality of this method and rename: getUcknowledgedWatermark to getLowestUcknowledgedWatermark

*/
public interface WatermarkStorage {

void commitWatermark(Watermark watermark)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This possibly needs to be thread safe

* A DataWriter that is WatermarkAware. Required for implementing writers that
* can operate in continuous ingestion mode.
*/
public interface WatermarkAwareWriter<D> extends DataWriter<D> {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very specific to this extractor, but it is worthwhile illustrating that for streaming true watermark management is being done by the Task framework via WatermarkManager rather than Job (Source) running in the driver JVM.

Or else, it will get confusing and messy for people trying to mix the watermarks at Job (Source) and Task level

I wonder if we should rather use another name for this? checkpoints?

}

@Override
public Optional<ComparableWatermark> getCommittableWatermark() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment getCommittableWatermark is only considering least of unacknowledged and committable from individual writers.

However, if fault tolerance is turned on, and we are ok with 10% failure, ie. ignore 1 record per 10 records, and your least unacknowledged write is at x whereas you have x+1 to x+10 all returned successfully; shouldn't your committable watermark be x+10?

There would be more such scenarios with failed and timed-out records combined with failure tolerance window which we should hash out (one argument could be that its upto the writer to handle such scenarios and return right committable / unacknowledged watermark)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that it would be up to the writer to handle such scenarios, but we do need to flesh out the expected behavior. Also, I don't expect there to be many WatermarkAwareWriters. Most writers will just be AsyncDataWriters / SyncDataWriters and just be responsible for writing records and flushing pending writes, so they won't care about this stuff.
There will possibly be a single WatermarkAwareWriter that wraps them. My current plan to have AsyncWriterManager be that class.

public interface WatermarkStorage {

void commitWatermark(Watermark watermark)
throws IOException;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this fail, or fail with certain tolerance ie. only fail if it cant persist for last 1 minute, because if some one turns checkpointing on for every millisecond and there are transient network / connection failures, it would fail the Task - which is not desirable?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way the current code is written (in WatermarkManager), failures in committing watermarks shouldn't fail the Task. At least that was my intention :)

+ " is not an instance of WatermarkStorage but the task is configured to run in continuous mode");
}
long commitIntervalMillis = 1000; // TODO: Configure
this.watermarkManager = Optional.of(this.closer.register
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting here since github isn't letting me comment on commit() method since you haven't touched it :) :

  • We should tweak commit() to consolidate handle watermark update for last of writes
  • Revisit publishTaskData() being invoked from commit()

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I haven't handled the "orderly shutdown" case at all. We should discuss the design before I write the code for it.

}

@Override
public void close()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't seem to be invoked anywhere, I was expecting this to be invoked via some sort of finishJob() / finishTask() and the last write of writers consolidated to update final watermark

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I haven't handled the "orderly shutdown" case at all. We should discuss the design before I write the code for it.

Copy link

@abti abti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// The parent task has already done pulling records so no new record means this fork is done
if (this.parentTaskDone) {
return;
} else {
this.logger.error("Found a {} record but parent task is not done. Aborting this fork.", record);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per what I see .. it should never get in else clause? did I miss some condition?

currentIteration++;
if (!externalWatermarkStorage.isEmpty()) {
for (CheckpointableWatermark watermark : externalWatermarkStorage.values()) {
System.out.println(watermark);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a bit of context in system.out would be helpful for debug

@shirshanka shirshanka force-pushed the checkpointing branch 3 times, most recently from 9ea782d to 9bb0e71 Compare January 8, 2017 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants