Skip to content

Commit

Permalink
start state transition
Browse files Browse the repository at this point in the history
  • Loading branch information
terrywbrady committed Feb 15, 2024
1 parent 89bc1d1 commit 812629d
Show file tree
Hide file tree
Showing 3 changed files with 295 additions and 179 deletions.
1 change: 1 addition & 0 deletions design/queue-2023/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
## Design Details
- [Queue State Transitions](states.md)
- [Queue Entry Data Storage](data.md)
- [State Transition Details](transition.md)
- [Underlying Queue Service](service.md)
- [Queue Manager](manager.md)
- [Batch and Queue State Enums](https://github.com/CDLUC3/merritt-tinker/tree/main/state-transition)
Expand Down
179 changes: 0 additions & 179 deletions design/queue-2023/states.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,79 +43,6 @@ At least one job FAILED
Determine if any previously FAILED jobs are not complete. If so, notify the depositor by email.

---



## Batch Queue State Transitions

### Start --> Pending
- generate batch_id
- create batch folder
- write payload to batch folder
- TODO: we should re-evaluate the maximum payload size without a manifest (currently 30G)
- set profile_name
- set submitter
- determine manifest_type
- set file_name
- examine the payload
- single - 1 job batch
- object manifest - 1 job batch
- manifest of manifest - N jobs
- manifest of zips - N jobs
- manifest of jobs - N jobs
- future json manifest (inline manifest of detailed object manifests)
- status = Pending
### Pending --> Held
- check if collection is held
- status = Held
### Pending --> Processing
- status = Processing
### Held --> Processing (admin function)
- status = Processing
### Processing --> Failed
- status = Failed
- set error_message
### Processing --> Reporting
- based on the payload
- single - we start a 1 job batch
- object manifest - we start a 1 job batch
- manifest of manifest - create N job entries and create the array of jobids in the batch object
- manifest of zips - create N job entries and create the array of jobids in the batch object
- construct JOB object(s)
- construct job folder(s)
- folder creation could be defererred to the job step
- create jobs in job queue
- we create status array
- status = Reporting
### Reporting --> Completed
- send summary email
- status = Completed
### Reporting --> Failed
- this occurs when at least one job has occurred
- status = Failed
- or is a batch done after it reports
- if jobs are re-run do they report on their own?
- do we create a "re-run batch"?
- or is this a question for the end users?
### Failed --> UpdateReporting
- manually triggered if some or all of the jobs have been re-run
- status = UpdateReporting
### UpdateReporting --> Completed
- detect any updated statuses and report them
- status = Completed
### UpdateReporting --> Failed
- detect any updated statuses and report them
- status = Completed
### Failed --> DELETED (admin function)
- delete any running jobs (and folders)
- delete batch folder
- status = Deleted
### Held --> Deleted (admin function)
- delete any running jobs (and folders)
- delete batch folder
- status = DELETED

---
## Job Queue

### Job Queue State Diagram
Expand Down Expand Up @@ -179,112 +106,6 @@ The queue will track the last successful step so that the job can be resumed at
---


## Job Queue State Transitions

### START --> Pending
- if payload is a single file and the depositor supplied a digest, perform checksum validation
- profile_name - constructor
- status = Pending
- batch_id - constructor
- job_id - generated
- workding_directory - derived from batch & job (evenually more flexible options)
- retry_count = 0
- priority - derived from
- profile
- size of the batch (constructor)
- payload_type - constructor
- payload_url - constructor
- submitter - constructor
- update_status - constructor
- digest_type - constructor (optional)
- digest_value - constructor (optional)
- space_needed = 0
- resource_to_provision - constructor
- local_id - constructor (read from ERC, from form parameter, or from manifest)
- ark - constructor (if supplied at ingest time, otherwise it will be minted)

### START --> Failed
- if payload digest does not match depositor digest
- if manifest is corrupt
- status = Failed (no recovery is possible)
### Pending --> Held
- evaluate if a collection hold is in place
- status = Held
### Pending --> Estimating
- status = Estimating
### Held --> Estimating (admin function)
- evaluate if collection hold has been removed
- status Estimating
### Estimating --> Provisioning
- HEAD request on every download that is needed (multi-thread)
- sum value into space_needed
- last_successful_state = Estimating
- status = Provisioning
### Provisioning --> Downloading
- if last_successful_state is not Estimating, total may be inaccurate
- determine if file system is available
- determine if there is adequate storage to proceed (throttle at 70% full disk)
- if space is sufficent state=Downloading
### Downloading --> Processing
- GET request on every download (multi-threaded), with a finite number of retries
- save files to working folder
- recalculate space_needed (in case estimate was inaccurate)
- perform digest validation (if user-supplied in manifest)
- last_successful_state = Downloading
- status = Processing
### Downloading --> Failed (downloading)
- status = Failed
- last_successful_state remains Estimating
- error_message = details the file that could not be downloaded
### Processing --> Recording
- Local_id lookup
- Mint ark using EZID if needed
- if local_id does not match user-supplied ark, fail
- Set ark
- Question: should we break Minting into a separate state
- small risk of wasting an ark if the minting process is rerun (only applicable if no localid is provided)
- Write ERC file
- Write dublin_core file
- Check digest for each file if needed (HandlerDigest)
- Create storage manifest (HandlerDigest)
- Request storage worker for handling request (very low risk of failure)
- Call storage enpoint to pass storage manifest
- Check return status from storage
- last_sucessful_state = Processing
- status = Recording
### Processing --> Failed (processing)
- due to minting failure or storage failure
- update error_message
- status = Failed
### Recording --> Notify
- Inventory will read and update THIS queue
- Save data to INV database
- status = Notify
- last_sucessful_state = Recording
### Recording --> Failed (recording)
- update error_message
- status = Failed
### Notify --> Completed
- Invoke callback (if defined)
- Notify batch queue that job is complete
- Status = Completed
- last_sucessful_state = Notify
- delete job folder
### Notify --> Failed
- status = Failed
### Failed --> Downloading
- reset status
### Failed --> Processing
- reset status
### Failed --> Recording
- reset status
### Failed --> DELETED
- delete job folder
### Held --> DELETED
- delete job folder

---

## Design Questions

- Should we have separate states for "Active Provisioning" vs "Capacity Checks"?
Expand Down
Loading

0 comments on commit 812629d

Please sign in to comment.