diff --git a/.github/workflows/mega-linter.yml b/.github/workflows/mega-linter.yml new file mode 100644 index 0000000..a8e3ec3 --- /dev/null +++ b/.github/workflows/mega-linter.yml @@ -0,0 +1,52 @@ +# MegaLinter GitHub Action configuration file +# More info at https://megalinter.io +--- +name: MegaLinter + +# Trigger mega-linter at every push. Action will also be visible from +# Pull Requests to main +on: + pull_request: + branches: + - main + +env: + APPLY_FIXES: none + APPLY_FIXES_EVENT: pull_request + +concurrency: + group: ${{ github.ref }}-${{ github.workflow }} + cancel-in-progress: true + +jobs: + megalinter: + name: MegaLinter + runs-on: ubuntu-latest + + # Give the default GITHUB_TOKEN write permission to comment the Megalinter output onto pull requests + # and read permission to the codebase + permissions: + contents: read + pull-requests: write + + steps: + # Git Checkout + - name: Checkout Code + uses: actions/checkout@v4 + with: + token: ${{ secrets.PAT || secrets.GITHUB_TOKEN }} + + # MegaLinter + - name: MegaLinter + + # You can override MegaLinter flavor used to have faster performances + # More info at https://megalinter.io/latest/flavors/ + uses: oxsecurity/megalinter@v8 + + id: ml + + # All available variables are described in documentation + # https://megalinter.io/latest/config-file/ + env: + VALIDATE_ALL_CODEBASE: true + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} \ No newline at end of file diff --git a/.gitignore b/.gitignore index 890d3f8..68f1437 100644 --- a/.gitignore +++ b/.gitignore @@ -2,4 +2,6 @@ *.iml target/* .java-version -.DS_Store \ No newline at end of file +.DS_Store +# Megalinter reports +megalinter-reports/ diff --git a/.markdownlint.yaml b/.markdownlint.yaml new file mode 100644 index 0000000..4f8c675 --- /dev/null +++ b/.markdownlint.yaml @@ -0,0 +1,6 @@ +# markdownlint YAML configuration + +# MD024/no-duplicate-heading/no-duplicate-header - Multiple headings with the same content +MD024: + # Only check sibling headings + siblings_only: true \ No newline at end of file diff --git a/.mega-linter.yml b/.mega-linter.yml new file mode 100644 index 0000000..a1a7d81 --- /dev/null +++ b/.mega-linter.yml @@ -0,0 +1,15 @@ +# Configuration file for MegaLinter +# +# See all available variables at https://megalinter.io/latest/config-file/ and in +# linters documentation + +# all, none, or list of linter keys +APPLY_FIXES: none + +# If you use ENABLE_LINTERS variable, all other linters will be disabled by +# default +ENABLE_LINTERS: + - MARKDOWN_MARKDOWNLINT + +# Ignore the pull request template +FILTER_REGEX_EXCLUDE: .github/PULL_REQUEST_TEMPLATE.md \ No newline at end of file diff --git a/Makefile b/Makefile index 2466d43..2b2df8d 100644 --- a/Makefile +++ b/Makefile @@ -22,3 +22,21 @@ test: docker-build: docker build . -t europe-west2-docker.pkg.dev/ssdc-rm-ci/docker/ssdc-rm-job-processor:latest + +megalint: ## Run the mega-linter. + docker run --platform linux/amd64 --rm \ + -v /var/run/docker.sock:/var/run/docker.sock:rw \ + -v $(shell pwd):/tmp/lint:rw \ + oxsecurity/megalinter:v8 + +megalint-fix: ## Run the mega-linter and attempt to auto fix any issues. + docker run --platform linux/amd64 --rm \ + -v /var/run/docker.sock:/var/run/docker.sock:rw \ + -v $(shell pwd):/tmp/lint:rw \ + -e APPLY_FIXES=all \ + oxsecurity/megalinter:v8 + +clean_megalint: ## Clean the temporary files. + rm -rf megalinter-reports + +lint_check: clean_megalint megalint diff --git a/README.md b/README.md index 9cf8932..1df3fb3 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,7 @@ The Job Processor handles the validation and processing of batches, for sample l The Job Processor takes data loaded in CSV files via the Support Tool or Response Operations UI, validates and transforms that data, before shuttling it onto Google Pub/Sub topics, for processing in the Case Processor. -# How Does It Work? +## How Does It Work? The Job Processor is designed as a single-instance application, which polls the database looking for Jobs to process. Jobs can be either staged data which needs validating, or validated data which needs processing (transforming and sending to the Case Processor). @@ -12,13 +12,14 @@ The reason it's a single instance is because there's no need to scale horizontal There might be a desire in future to make the Job Processor better at handling concurrent Jobs, but for now, we expect that most Jobs are so small that users won't notice that there's an element of queueing happening. -# How Is It Structured? +## How Is It Structured? The `StagedJobValidator` polls the database for Jobs which are in `VALIDATION_IN_PROGRESS` status, meaning that file staging is complete and the rows are in the database and ready to be validated. When the `StagedJobValidator` finds a job, it checks to see if there are any JobRows with `STAGED`, meaning that the rows have been staged but not yet validated. -Having seen that there are some JobRows to validate, the `StagedJobValidator` asks the `RowChunkValidator` to validate a 'chunk' of rows... the chunking is done for performance and to keep transaction size down to something reasonable for huge batches. If any JobRow has a validation problem, the Job will be marked as `VALIDATED_WITH_ERRORS` so that the user can see that there are problems with the CSV file. +Having seen that there are some JobRows to validate, the `StagedJobValidator` asks the `RowChunkValidator` to validate a 'chunk' of rows... the chunking is done for performance and to keep transaction size down to something reasonable for huge batches. +If any JobRow has a validation problem, the Job will be marked as `VALIDATED_WITH_ERRORS` so that the user can see that there are problems with the CSV file. Finally, once all the JobRows have been validated, the `StagedJobValidator` will mark the Job as `VALIDATED_OK` if there were no validation failures with any of the JobRows. There's a small bug here, where a restart of the Job Processor could theoretically cause a Job to be marked as OK when in reality it's actually got errors on some of the rows. @@ -28,14 +29,16 @@ When the `ValidatedJobProcessor` finds a job, it loops through processing chunks The whole application is designed to be very extensible (see the `Transformer` interface and `JobTypeHelper`) so don't hack it too much if you're adding anything new! -# A Final Word on Performance +## A Final Word on Performance -The batch sizes have been chosen quite aritrarily, but the application should be able to process files of 30 million in under 8 hours, when given some decent hardware... start with beefing up the database to get the max IOPS, increase the RAM and CPU of the database... then throw some more resources at the Job Processor if it's still not as fast as you want. +The batch sizes have been chosen quite arbitrarily, but the application should be able to process files of 30 million in under 8 hours, when given some decent hardware... start with beefing up the database to get the max IOPS, increase the RAM and CPU of the database... then throw some more resources at the Job Processor if it's still not as fast as you want. Although it scales vertically not horizontally, the limiting factor is always going to be the number of transactions per second that the database can handle, so start by tuning that first. There are a lot of configuration options available, so the author discourages code changes - start with the infrastructure, and then try config changes. It would seem highly unlikely that the design of the code would ever be the bottleneck, even for the biggest surveys that the ONS does (e.g. Census). -People might say "but I can load a CSV file in milliseconds"... yes, but not a very large one with robust protection against the process crashing part-way through. For sure, you can process small files in a single transaction very quickly, but once you get to large long-lived transactions, the problem is much harder, if you need a robust and reliable production-strength solution, which you would entrust to do duties like Census. +People might say "but I can load a CSV file in milliseconds"... yes, but not a very large one with robust protection against the process crashing part-way through. For sure, you can process small files in a single transaction very quickly, but once you get to large long-lived transactions, the problem is much harder. + +If you need a robust and reliable production-strength solution, which you would entrust to do duties like Census. Beware short-cuts which 'appear' to improve performance, but at the price of being flakey and unable to withstand a hardware failure or unexpected shutdown. \ No newline at end of file