Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Django-structured ingestion architecture, see #HEA-159 #86

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

chrispreee
Copy link
Contributor

@chrispreee chrispreee commented Apr 8, 2024

Overview here:

https://docs.google.com/document/d/1zgOcpqhmzCtUD1NfJneC8EHyrv-AiWnxhRAeJB_KmXA/edit?usp=sharing

Django-based Ingestion Architecture

Management Command
load_from_bss [livelihoodzone_baseline_id]
If no is ID provided, it imports most recently modified.

Decorator
@register
Registers model importers, one importer class per model

Exceptions
ImportException
LookupException
ReferenceDataException
Exception hierarchy enables customized behaviour depending on the type of issue.

Base Importer Class (abstract base class)
To define:
Meta.model
The model whose instances we are generating, eg, Community
Meta.fields
The fields we are scraping from the BSS
Meta.parent_model_fields
The fields that point to a parent instance (that has already been ingested).
Meta.dependent_model_fields
The next models to import after this one.

def get_<field_name>
Iterates over the cells, passing the cell addresses to Importer.attempt_load_from_cell

Other overrides are possible for niche use cases.

Provided functionality

Importer.attempt_load_from_cell
This method:
Processes source value, eg, strips, cleans
Passes source value to mapper, described below.
If a match is found, saves the spreadsheet location, source value, mapped/parsed value and instance this value will be saved on.
Logs successful and failed scans.

The base Importer class then compiles these locations into Django instances, adds in the parent IDs, tries to run a method on the model to display validation warnings, and saves the instances. It then runs the dependent models’ importers, passing them the parent instances.

If the BSS was successfully imported, the logs of failed scans and lookups are now deleted.

Mapper classes (and factory)

The mapper classes convert the raw value from the spreadsheet into a processed value ready to be saved on the target model instance, eg, a foreign key, an integer, a choice code, etc. The mapper knows how to treat the value based on the Django field definition.

Mappers are instantiated using the factory pattern, so that lookups can be loaded once for the full import ingestion run, then cleanly disposed on completion.

Django Model Classes

BssValueExtractor
Records regexes for each field. If not defined then the whole value from the spreadsheet is used.
Encapsulates the logic of applying the regexes to match an alias.

SpreadsheetLocation
This is really the core of this architecture, as it untangles and enables the scanning and ingestion of individual field values in whichever sequence and structure makes things easiest.
It links a value loaded, parsed, formatted and normalized from a spreadsheet, to the model, field and instance it is to be saved on. (It also stores the regex, source value, alias, etc, that realized the mapping.)

ChoiceAlias and FieldNameAlias
Aliases for choice fields and field names, eg, quantity_produced.

ImportLog
Logs are intercepted and stored here, along with context, such as cell, field, instance, parent instances to date, successfully instantiated instances to date, fields and mappings so far, regex, alias, failed mappings and scans, source value, mapped value, etc, attached to the log message.

ScanLog
Logs every attempt to extract a value for a field from a cell and map it, along with the context listed above.

@chrispreee chrispreee force-pushed the HEA-159/django-structured-ingestion-architecture branch 7 times, most recently from aec2041 to 642aff4 Compare April 8, 2024 14:37
@chrispreee chrispreee changed the title Django structured ingestion architecture, see #HEA-159 Django-structured ingestion architecture, see #HEA-159 Apr 8, 2024
@chrispreee chrispreee force-pushed the HEA-159/django-structured-ingestion-architecture branch from 642aff4 to 14fa8a3 Compare April 8, 2024 14:41
@chrispreee chrispreee requested a review from rhunwicks April 10, 2024 12:37
…nces cardinality assumption bug, implement ImportRun, validate successful_mappings cardinality, fix exception traceback (to add stack traceback to log/s), fix simple value mapper logging, update help text, see #HEA-159
@chrispreee
Copy link
Contributor Author

chrispreee commented May 31, 2024

@rhunwicks the mappers automatically use Choice codes and labels for lookups, so this can now map strategy types without any reference data. I've been using MWWRM_30Sep15.xlsx, but any English BSS should run to completion.

To see the LivelihoodActivity importer running, uncomment baseline.importers.LivelihoodStrategyImporter._save_instances to save a factory LivelihoodStrategy that LivelihoodActivityImporter can join to.

(Or what I was doing was comment out "livelihood_strategy" from LivelihoodActivityImporter.parent_model_fields so it doesn't try to join to the parent. We could make ingestion depth-first, ie, import child models before the rest of the parent models, to make this clearer, but I thought the logs would be harder to follow.)

Only the fields with ingest_ methods defined map. Before implementing them all, I'd extract the aliases from the RefData GSheet. I would use the whole label text as the alias, ie, no BssValueExtractors, to start with. This would be trivial to extract from the RefData GSheet into a RefData import file, and would be functionally equivalent to your approach.

We can then iteratively generalize only those labels that generalize best to other cells and BSSes by creating BssValueExtractor instances with regexes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant