Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Megalinter and fix any findings #94

Merged
merged 2 commits into from
Sep 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions .github/workflows/mega-linter.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# MegaLinter GitHub Action configuration file
# More info at https://megalinter.io
---
name: MegaLinter

# Trigger mega-linter at every push. Action will also be visible from
# Pull Requests to main
on:
pull_request:
branches:
- main

env:
APPLY_FIXES: none
APPLY_FIXES_EVENT: pull_request

concurrency:
group: ${{ github.ref }}-${{ github.workflow }}
cancel-in-progress: true

jobs:
megalinter:
name: MegaLinter
runs-on: ubuntu-latest

# Give the default GITHUB_TOKEN write permission to comment the Megalinter output onto pull requests
# and read permission to the codebase
permissions:
contents: read
pull-requests: write

steps:
# Git Checkout
- name: Checkout Code
uses: actions/checkout@v4
with:
token: ${{ secrets.PAT || secrets.GITHUB_TOKEN }}

# MegaLinter
- name: MegaLinter

# You can override MegaLinter flavor used to have faster performances
# More info at https://megalinter.io/latest/flavors/
uses: oxsecurity/megalinter@v8

id: ml

# All available variables are described in documentation
# https://megalinter.io/latest/config-file/
env:
VALIDATE_ALL_CODEBASE: true
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,6 @@
*.iml
target/*
.java-version
.DS_Store
.DS_Store
# Megalinter reports
megalinter-reports/
6 changes: 6 additions & 0 deletions .markdownlint.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# markdownlint YAML configuration

# MD024/no-duplicate-heading/no-duplicate-header - Multiple headings with the same content
MD024:
# Only check sibling headings
siblings_only: true
15 changes: 15 additions & 0 deletions .mega-linter.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Configuration file for MegaLinter
#
# See all available variables at https://megalinter.io/latest/config-file/ and in
# linters documentation

# all, none, or list of linter keys
APPLY_FIXES: none

# If you use ENABLE_LINTERS variable, all other linters will be disabled by
# default
ENABLE_LINTERS:
- MARKDOWN_MARKDOWNLINT

# Ignore the pull request template
FILTER_REGEX_EXCLUDE: .github/PULL_REQUEST_TEMPLATE.md
18 changes: 18 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,21 @@ test:

docker-build:
docker build . -t europe-west2-docker.pkg.dev/ssdc-rm-ci/docker/ssdc-rm-job-processor:latest

megalint: ## Run the mega-linter.
docker run --platform linux/amd64 --rm \
-v /var/run/docker.sock:/var/run/docker.sock:rw \
-v $(shell pwd):/tmp/lint:rw \
oxsecurity/megalinter:v8

megalint-fix: ## Run the mega-linter and attempt to auto fix any issues.
docker run --platform linux/amd64 --rm \
-v /var/run/docker.sock:/var/run/docker.sock:rw \
-v $(shell pwd):/tmp/lint:rw \
-e APPLY_FIXES=all \
oxsecurity/megalinter:v8

clean_megalint: ## Clean the temporary files.
rm -rf megalinter-reports

lint_check: clean_megalint megalint
15 changes: 9 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,21 +4,22 @@ The Job Processor handles the validation and processing of batches, for sample l

The Job Processor takes data loaded in CSV files via the Support Tool or Response Operations UI, validates and transforms that data, before shuttling it onto Google Pub/Sub topics, for processing in the Case Processor.

# How Does It Work?
## How Does It Work?

The Job Processor is designed as a single-instance application, which polls the database looking for Jobs to process. Jobs can be either staged data which needs validating, or validated data which needs processing (transforming and sending to the Case Processor).

The reason it's a single instance is because there's no need to scale horizontally: the biggest files we would ever need to handle are Census sample files (30 million rows, approx) and a single instance is more than capable of processing that amount of data in a reasonable timeframe.

There might be a desire in future to make the Job Processor better at handling concurrent Jobs, but for now, we expect that most Jobs are so small that users won't notice that there's an element of queueing happening.

# How Is It Structured?
## How Is It Structured?

The `StagedJobValidator` polls the database for Jobs which are in `VALIDATION_IN_PROGRESS` status, meaning that file staging is complete and the rows are in the database and ready to be validated.

When the `StagedJobValidator` finds a job, it checks to see if there are any JobRows with `STAGED`, meaning that the rows have been staged but not yet validated.

Having seen that there are some JobRows to validate, the `StagedJobValidator` asks the `RowChunkValidator` to validate a 'chunk' of rows... the chunking is done for performance and to keep transaction size down to something reasonable for huge batches. If any JobRow has a validation problem, the Job will be marked as `VALIDATED_WITH_ERRORS` so that the user can see that there are problems with the CSV file.
Having seen that there are some JobRows to validate, the `StagedJobValidator` asks the `RowChunkValidator` to validate a 'chunk' of rows... the chunking is done for performance and to keep transaction size down to something reasonable for huge batches.
If any JobRow has a validation problem, the Job will be marked as `VALIDATED_WITH_ERRORS` so that the user can see that there are problems with the CSV file.

Finally, once all the JobRows have been validated, the `StagedJobValidator` will mark the Job as `VALIDATED_OK` if there were no validation failures with any of the JobRows. There's a small bug here, where a restart of the Job Processor could theoretically cause a Job to be marked as OK when in reality it's actually got errors on some of the rows.

Expand All @@ -28,14 +29,16 @@ When the `ValidatedJobProcessor` finds a job, it loops through processing chunks

The whole application is designed to be very extensible (see the `Transformer` interface and `JobTypeHelper`) so don't hack it too much if you're adding anything new!

# A Final Word on Performance
## A Final Word on Performance

The batch sizes have been chosen quite aritrarily, but the application should be able to process files of 30 million in under 8 hours, when given some decent hardware... start with beefing up the database to get the max IOPS, increase the RAM and CPU of the database... then throw some more resources at the Job Processor if it's still not as fast as you want.
The batch sizes have been chosen quite arbitrarily, but the application should be able to process files of 30 million in under 8 hours, when given some decent hardware... start with beefing up the database to get the max IOPS, increase the RAM and CPU of the database... then throw some more resources at the Job Processor if it's still not as fast as you want.

Although it scales vertically not horizontally, the limiting factor is always going to be the number of transactions per second that the database can handle, so start by tuning that first.

There are a lot of configuration options available, so the author discourages code changes - start with the infrastructure, and then try config changes. It would seem highly unlikely that the design of the code would ever be the bottleneck, even for the biggest surveys that the ONS does (e.g. Census).

People might say "but I can load a CSV file in milliseconds"... yes, but not a very large one with robust protection against the process crashing part-way through. For sure, you can process small files in a single transaction very quickly, but once you get to large long-lived transactions, the problem is much harder, if you need a robust and reliable production-strength solution, which you would entrust to do duties like Census.
People might say "but I can load a CSV file in milliseconds"... yes, but not a very large one with robust protection against the process crashing part-way through. For sure, you can process small files in a single transaction very quickly, but once you get to large long-lived transactions, the problem is much harder.

If you need a robust and reliable production-strength solution, which you would entrust to do duties like Census.

Beware short-cuts which 'appear' to improve performance, but at the price of being flakey and unable to withstand a hardware failure or unexpected shutdown.