Django-structured ingestion architecture, see #HEA-159 #86
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview here:
https://docs.google.com/document/d/1zgOcpqhmzCtUD1NfJneC8EHyrv-AiWnxhRAeJB_KmXA/edit?usp=sharing
Django-based Ingestion Architecture
Management Command
load_from_bss [livelihoodzone_baseline_id]
If no is ID provided, it imports most recently modified.
Decorator
@register
Registers model importers, one importer class per model
Exceptions
ImportException
LookupException
ReferenceDataException
Exception hierarchy enables customized behaviour depending on the type of issue.
Base Importer Class (abstract base class)
To define:
Meta.model
The model whose instances we are generating, eg, Community
Meta.fields
The fields we are scraping from the BSS
Meta.parent_model_fields
The fields that point to a parent instance (that has already been ingested).
Meta.dependent_model_fields
The next models to import after this one.
def get_<field_name>
Iterates over the cells, passing the cell addresses to Importer.attempt_load_from_cell
Other overrides are possible for niche use cases.
Provided functionality
Importer.attempt_load_from_cell
This method:
Processes source value, eg, strips, cleans
Passes source value to mapper, described below.
If a match is found, saves the spreadsheet location, source value, mapped/parsed value and instance this value will be saved on.
Logs successful and failed scans.
The base Importer class then compiles these locations into Django instances, adds in the parent IDs, tries to run a method on the model to display validation warnings, and saves the instances. It then runs the dependent models’ importers, passing them the parent instances.
If the BSS was successfully imported, the logs of failed scans and lookups are now deleted.
Mapper classes (and factory)
The mapper classes convert the raw value from the spreadsheet into a processed value ready to be saved on the target model instance, eg, a foreign key, an integer, a choice code, etc. The mapper knows how to treat the value based on the Django field definition.
Mappers are instantiated using the factory pattern, so that lookups can be loaded once for the full import ingestion run, then cleanly disposed on completion.
Django Model Classes
BssValueExtractor
Records regexes for each field. If not defined then the whole value from the spreadsheet is used.
Encapsulates the logic of applying the regexes to match an alias.
SpreadsheetLocation
This is really the core of this architecture, as it untangles and enables the scanning and ingestion of individual field values in whichever sequence and structure makes things easiest.
It links a value loaded, parsed, formatted and normalized from a spreadsheet, to the model, field and instance it is to be saved on. (It also stores the regex, source value, alias, etc, that realized the mapping.)
ChoiceAlias and FieldNameAlias
Aliases for choice fields and field names, eg, quantity_produced.
ImportLog
Logs are intercepted and stored here, along with context, such as cell, field, instance, parent instances to date, successfully instantiated instances to date, fields and mappings so far, regex, alias, failed mappings and scans, source value, mapped value, etc, attached to the log message.
ScanLog
Logs every attempt to extract a value for a field from a cell and map it, along with the context listed above.