Skip to content

Commit

Permalink
retry command for GCS poll way of running (#668)
Browse files Browse the repository at this point in the history
* retry command for GCS poll way of running

* retry command for GCS poll way of running
  • Loading branch information
aksharauke authored Oct 20, 2023
1 parent 386defc commit 1d29dd1
Showing 1 changed file with 27 additions and 3 deletions.
30 changes: 27 additions & 3 deletions docs/troubleshoot/minimal.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ The following error scenarios are possible currently when doing low downtime mig
1. Other SpannerExceptions - which are marked for retry
1. In addition, there is a possibility of severe errors that would require manual intervention. Examples of severe error could be error during transformation.

Points 1 to 4 above are retryable errors - the Dataflow job automatically retries them at intervals of 10 minutes for 500 times. In most cases, this should be good enough for the retryable records to succeed, however, even if after exhausting all the retries, these are not successful - then these records are marked as ‘severe' error category. Such ‘severe' errors can be retried later with a ‘retryDLQ' mode of the Dataflow job (discussed below in the ‘Retry command' section).
Points 1 to 4 above are retryable errors - the Dataflow job automatically retries them at intervals of 10 minutes for 500 times. In most cases, this should be good enough for the retryable records to succeed, however, even if after exhausting all the retries, these are not successful - then these records are marked as ‘severe' error category. Such ‘severe' errors can be retried later with a ‘retryDLQ' mode of the Dataflow job (discussed [below](#to-re-run-for-reprocessing-dlq-directory)).
The following scenarios results in skipping of records, they are not really errors:

1. Invalid structure of records read from Datastream output
Expand Down Expand Up @@ -76,15 +76,39 @@ Migration progress can be tracked by monitoring the Dataflow job and following c

It can happen that in retryDLQ mode, there are still permanent errors. To identify that all the retryable errors have been processed and only permanent errors remain for reprocessing - one can look at the ‘Successful events' count - it would remain constant after every retry iteration. Each retry iteration, the ‘elementsReconsumedFromDeadLetterQueue' would increment.

### Retry command
### Re-run commands

#### To rerun regular flow

To rerun the regular flow, the same command as original needs to be fired. Note: This will only work when not using the PubSub subscriptions for GCS files.The processing starts all over again, meaning the same Datastream outputs get reprocessed.

```
gcloud dataflow flex-template run <jobName> \
--project=<project-name> --region=<region-name> \
--template-file-gcs-location=gs://dataflow-templates-southamerica-west1/2023-09-12-00_RC00/flex/Cloud_Datastream_to_Spanner \
--num-workers 1 --max-workers 50 \
--enable-streaming-engine \
--parameters databaseId=<database id>,deadLetterQueueDirectory=<GCS location of the DLQ directory>,inputFilePattern=<gcs location of the datastream output>,instanceId=<spanner-instance-id>,sessionFilePath=<GCS location of the session json>,streamName=<data stream name>,transformationContextFilePath=<path to transformation context json>
```

These job parameters can be taken from the original job.

#### To re-run for reprocessing DLQ directory

This will reprocess the records marked as ‘severe' error records from the DLQ.
Before running the Dataflow job, check if the main Dataflow job has non-zero retryable error count. In case there are referential error records - check that the dependent table data is populated completely from the source database.

Sample command to run the Dataflow job in retryDLQ mode is

```sh
gcloud beta dataflow flex-template run <jobname> --region=<the region where the dataflow job must run> --template-file-gcs-location=gs://dataflow-templates/latest/flex/Cloud_Datastream_to_Spanner --additional-experiments=use_runner_v2 --parameters inputFilePattern=<GCS location of the input file pattern>,streamName=<Datastream name>,instanceId=<Spanner Instance Id>,databaseId=<Spanner Database Id>,sessionFilePath=<GCS path to session file>,deadLetterQueueDirectory=<GCS path to the DLQ>,runMode=retryDLQ
gcloud dataflow flex-template run <jobname> \
--region=<the region where the dataflow job must run> \
--template-file-gcs-location=gs://dataflow-templates/latest/flex/Cloud_Datastream_to_Spanner \
--additional-experiments=use_runner_v2 \
--parameters inputFilePattern=<GCS location of the input file pattern>,streamName=<Datastream name>, \
instanceId=<Spanner Instance Id>,databaseId=<Spanner Database Id>,sessionFilePath=<GCS path to session file>, \
deadLetterQueueDirectory=<GCS path to the DLQ>,runMode=retryDLQ
```

The following parameters can be taken from the regular forward migration Dataflow job:
Expand Down

0 comments on commit 1d29dd1

Please sign in to comment.