Dataform package containing commonly used SQL functions and table definitions, for use with event data streamed to BigQuery using DfE-Digital/dfe-analytics: https://github.com/DFE-Digital/dfe-analytics.
- Set up your Dataform project - see Google documentation.
- Ensure that it is connected to your BigQuery project.
- Ensure that it is synchronised with its own dedicated Github repository.
- Set up a production release configuration - see Google documentation.
- Add the following line within the dependencies block of the package.json file in your Dataform project, replacing the Xs with the latest version number of this package.:
"dfe-analytics-dataform": "https://github.com/DFE-Digital/dfe-analytics-dataform/archive/refs/tags/vX.X.X.tar.gz"
It should now look something like:
{
"dependencies": {
"@dataform/core": "X.X.X",
"dfe-analytics-dataform": "https://github.com/DFE-Digital/dfe-analytics-dataform/archive/refs/tags/vX.X.X.tar.gz"
}
}
- Click the 'Install Packages' button on the right hand side of the package.json screen. This will also update package-lock.json automatically.
- Create a file called
includes/data_functions.js
containing the following line:
module.exports = require("dfe-analytics-dataform/includes/data_functions");
- Create a second file called
definitions/dfe_analytics_dataform.js
that looks like the following:
const dfeAnalyticsDataform = require("dfe-analytics-dataform");
// Repeat the lines below for each and every events table you want dfe-analytics-dataform to process in your Dataform project - distinguish between them by giving each one a different eventSourceName. This will cause all the tables produced automatically by dfe-analytics-dataform to have your suffix included in them to allow users to tell the difference between them.
dfeAnalyticsDataform({
eventSourceName: "Short name for your event source here - this might be a short name for your service, for example",
bqEventsTableName: "Your BigQuery events table name here - usually just 'events'",
urlRegex: "www.yourdomainname.gov.uk", // re-2 formatted regular expression to use to identify whether a URL is this service's own URL or an external one. If your service only has one domain name set this to 'www.yourdomainname.gov.uk' (without the protocol). If you have more than one use something like '(?i)(www.domain1.gov.uk|www.domain2.gov.uk|www.domain3.gov.uk)'
expirationDays: false, // replace with the integer number of days you want to delete data in your BigQuery project after, or leave as false if you don't want to enforce a project-wide data retention schedule
dataSchema: [{
entityTableName: "Your entity table name here from your production database analytics.yml",
description: "Description of this entity to include in metadata of denormalised tables produced for this entity.",
keys: [{
keyName: "Your string field name here",
dataType: "string",
description: "Description of this field to include in metadata here."
}, {
keyName: "Your boolean field name here",
dataType: "boolean",
description: "Description of this field to include in metadata here.",
alias: "Name to give this field in queries"
}, {
keyName: "Your date field name here",
dataType: "date",
description: "Description of this field to include in metadata here."
}]
}]
});
-
Replace the values of the
eventSourceName
,bqEventsTableName
andurlRegex
parameters in this file with the values you need. At this stage compilation errors are normal because you haven't yet run the pipeline for the first time, which creates the output dataset. You will do this later. -
Execute the
events_{eventSourceName}
from your development workspace with the 'Full Refresh' option enabled. This may take some time to complete. See Google documentation for more information. -
If you are using
dfe-analytics
(notdfe-analytics-dotnet
), do the same for thedataSchema
parameter - a JSON value which specifies the schema of the tables and field names and types in your database which are being streamed to BigQuery. Don't include theid
,created_at
andupdated_at
fields - they are included automatically. If you're starting from scratch, don't type this out. Instead, to save time, generate a blankdataSchema
JSON to paste in to this file by running the...data_schema_latest
query in Dataform (in the same way that you did in the previous step for theevents
table). You can do this by copying and pasting the output from the preview of the table this produces in BigQuery. This will automatically attempt to work out table names, field names and data types for each field using data streamed the previous day and today - although you might need to add in any tables that didn't have entity events streamed yesterday/today manually, or tweak some data types. -
If you are using
dfe-analytics-dotnet
, replace the entiredataSchema: [ ... ],
parameter withtransformEntityEvents: false,
- since the .NET port ofdfe-analytics
does not yet stream entity events into BigQuery. -
In the unlikely event that your
events
table is stored in a different BigQuery project to the project named in yourdefaultDatabase
parameter in your GCP Dataform release configuration, add the linebqProjectName: 'name_of_the_bigquery_project_your_events_table_is_in',
just before the line which setsbqDatasetName
parameter indfe_analytics_dataform.js
. -
If your
events
table is stored in a different dataset to the dataset configured in your GCP Dataform release configuration, add the linebqDatasetName: "name_of_the_bigquery_project_your_events_table_is_in",
just before the line which setsbqDatasetName
parameter indfe_analytics_dataform.js
. -
In the unlikely event that one or more of the tables in your database being streamed by
dfe-analytics
has a primary key that is notid
, add the lineprimaryKey: 'name_of_primary_key_for_this_table'
to the element of the dataSchema that represents this table, alongside theentityTableName
. -
If your
dfe-analytics
implementation uses thenamespace
field to distinguish between multiple interfaces or applications that result in data streamed to the sameevents
table, add the linebqEventsTableNameSpace: 'your_namespace_here'
after the line that sets thebqEventsTableName
parameter. To usedfe-analytics-dataform
with more than onebqEventsTableNameSpace
, calldfeAnalyticsDataform();
once per value ofnamespace
- this allows configuration options to differ between namespaces. -
Commit your changes and merge to the
main
/master
branch of the Github repository linked to your Dataform repository. -
Leave your development workspace. Run a one-off manual compilation of your production release configuration.
-
Run a 'full refresh' execution of your entire pipeline using this release configuration, and resolve any configuration errors this flags (e.g. omissions made when specifying a
dataSchema
).
You may in addition to step 8 of the setup instructions wish to configure the following options by adding them to the JSON passed to the dfeAnalyticsDataform()
JavaScript function.
bqProjectName
- name of the BigQuery project that your events table is located in. Defaults to the same BigQuery project configured in your GCP Dataform release configuration.bqDatasetName
- name of the BigQuery dataset that your events table is located in. Defaults to the same BigQuery dataset configured in your GCP Dataform release configuration.bqEventsTableNameSpace
- value of thenamespace
field in theevents
table to filter all data being transformed by before it enters the pipeline. Use this if yourdfe-analytics
implementation uses thenamespace
field to distinguish between multiple interfaces or applications that result in data streamed to the sameevents
table in BigQuery.null
by default. To usedfe-analytics-dataform
with more than onebqEventsTableNameSpace
, calldfeAnalyticsDataform();
once per possible value ofnamespace
- this allows configuration options to differ between namespaces.transformEntityEvents
- whether to generate queries that transform entity CRUD events into flattened tables. Boolean (true
orfalse
, without quotes). Defaults totrue
if not specified.customEventSchema
- formatted similarly todataSchema
. Used to configure transformation ofdfe-analytics
custom events. See "Updating yourdataSchema
andcustomEventSchema
configurations" section below for details.hiddenPolicyTagLocation
- a string, see section "Hidden fields" below.enableSessionTables
- whether to generate thesessions
andpageviews_with_funnels
tables. Boolean (true
orfalse
, without quotes). Defaults totrue
if not specified.enableSessionDetailsTable
- whether to generate thesession_details
table. Boolean (true
orfalse
, without quotes). Defaults totrue
if not specified.enableMonitoring
- whether to send summary monitoring data to the monitoring.pipeline_snapshots table in the cross-service GCP project. Boolean (true
orfalse
, without quotes). Defaults totrue
if not specified.eventsDataFreshnessDays
- number of days after which, if no new events have been received, the events_data_not_fresh assertion will fail to alert you to this. Defaults to1
if not specified.eventsDataFreshnessDisableDuringRange
- if set totrue
, disables the events_data_not_fresh assertion if today's date is currently between one of the ranges in assertionDisableDuringDateRangesassertionDisableDuringDateRanges
- an array of day or date ranges between which some assertions will be disabled if other parameters are set to disable them. Each range is a hash containing either the integer values fromDay, fromMonth, toDay and toMonth or the date values fromDate and toDate. Defaults to an approximation to school holidays each year:[{fromMonth: 7, fromDay: 25, toMonth: 9, toDay: 1}, {fromMonth: 3, fromDay: 29, toMonth: 4, toDay: 14}, {fromMonth: 12, fromDay: 22, toMonth: 1, toDay: 7}]
compareChecksums
-true
orfalse
. Defaults tofalse
. Enables a currently experimental assertion which fails if checksums and row counts in entity_table_check events do not match checksums and row counts in BigQueryfunnelDepth
- number of steps forwards/backwards to analyse in pageview funnels - higher allows deeper analysis, lower reduces CPU usage and cost. Defaults to10
if not specified.requestPathGroupingRegex
- re2-formatted regular expression to replace with the string 'UID' when grouping web request paths in funnel analysis. Defaults to'[0-9a-zA-Z]*[0-9][0-9a-zA-Z]*'
if not specified (i.e. replaces unbroken strings of alphanumeric characters that include one or more numeric characters with 'UID')dependencies
- array of strings listing the names of Dataform datasets generated outsidedfe-analytics-dataform
which should be materialised before datasets generated bydfe-analytics-dataform
. Defaults to[]
(an empty array) if not specified.attributionParameters
- list of parameters to extract from the request query as individual fields in funnel analysis (for funnels which are new traffic to the site only). Defaults to['utm_source', 'utm_campaign', 'utm_medium', 'utm_content', 'gclid', 'gcsrc','fbclid','dclid','gad_source','ds_rl','gbraid','msclkid']
if not specified.attributionDomainExclusionRegex
- re2-formatted regular expression to use to detect domain names which should be excluded from attribution modelling - for example, the domain name of an authentication service which never constitutes an external referral to your service. Defaults to"(?i)(signin.education.gov.uk)"
if not specified.socialRefererDomainRegex
- re2-formatted regular expression to use to work out whether an HTTP referer's domain name is a social media site. Defaults to'(?i)(facebook|twitter|^t.co|linkedin|youtube|pinterest|quora|wechat|weibo|whatsapp|tumblr|reddit|x.com|instagram|tiktok|messenger|telegram|douyin|qq|qzone|josh|teams|skype|tieba|threads|viber|imo|xiaohongshu|line|picsart|bluesky)'
if not specified.searchEngineRefererDomainRegex
- re2-formatted regular expression to use to work out whether an HTTP referer's domain name is a search engine (regardless of whether paid or organic). Defaults to'(?i)(google|bing|yahoo|aol|ask.co|baidu|duckduckgo|dogpile|ecosia|entireweb|exalead|gigablast|hotbot|info.com|lycos|metacrawler|mojeek|qwant|searx|startpage|swisscows|webcrawler|yandex|yippy)'
if not specified.disabled
-true
orfalse
. Defaults tofalse
. If set totrue
then calling the package will not do anything.webRequestEventExpirationDays
- integer number of days after whichdfe-analytics-dataform
will delete web request events from your events tables, pageview table and sessions table
Users are notified through internal channels when a new version of dfe-analytics-dataform
is released. To update:
- In your Dataform project, modify your
package.json
file to change the version number in this line from the version number you are currently using to the version number you wish to update to:
"dfe-analytics-dataform": "https://github.com/DFE-Digital/dfe-analytics-dataform/archive/refs/tags/vX.Y.Z.tar.gz"
- Click on 'Install packages' on the right hand side of this screen in Dataform.
- Commit and merge your changes to your main/master branch.
- Read the release notes provided for each new release between the release you were using and the release you are updating to. These may include additional instructions to follow to update to this version - for example, running a full refresh on particular tables. Follow these instructions for all these releases, in the order that they were released.
- If you haven't already run the whole pipeline in the previous step, run the whole pipeline now in Dataform. A full refresh should not be required unless specified in the release notes.
- Ensure you - and your colleagues - pull the latest version of your code from Github and click 'Install packages' in every Dataform development workspace.
Update your dataSchema
configuration whenever you wish to change the form that data is transformed into in any of the flattened entity tables generated by dfe-analytics-dataform
.
Update your customEventSchema
configuration whenever you wish to transform dfe-analytics custom events into a flattened table for each custom event type.
Once you have updated your dataSchema, commit your changes, merge to main/master and then re-run your pipeline in Dataform.
If you update your customEventSchema
, this pipeline run must 'full refresh' the relevant incremental tables in Dataform. See Dataform documentation for more information.
dataSchema
is a JSON array of objects: dataSchema: [{}, {}, {}...]
. Each of these objects represents a table in your schema. It has the following attributes:
entityTableName
* - the name of the table in your database; a string; mandatorydescription
- a meta description of the table, which can be a blank string:''
; mandatorykeys
- an array of objects which determines howdfe-analytics-dataform
will transform each of the fields in the table. Each table listed withindataSchema
has its ownkeys
i.e. :dataSchema: [{entityTableName: '', description: '', keys: {}}, {entityTableName: '', description: '', keys: {}}, {entityTableName: '', description: '', keys: {}}...]
.primaryKey
* - optional; if the table has a primary key that is notid
, then this should contain the name of the field that is the primary key, if it is not'id'
.hidePrimaryKey
* - boolean, see section "Hidden fields" belowdataFreshnessDays
- optional; if set, creates an assertion which fails if no Create, Update or Delete events have been received in the lastdataFreshnessDays
days for this entity.dataFreshnessDisableDuringRange
- optional; if set totrue
, disables this assertion if today's date is currently between one of the ranges inassertionDisableDuringDateRanges
materialisation
* - optional; may be'view'
or'table'
. Defaults to 'table' if not set. Determines whether theentity_version
,entity_latest
andentity_field_updates
tables for this entity will be materialised by Dataform as views or tables. Recommended usage is to set this to'table'
if these tables will be used more than once a day, or'view'
if not to save query costs.expirationDays
- optional; integer number of days after which entity events for this table, or custom events with type, will be deleted.
Each object within each table's set of keys
determines how dfe-analytics-dataform
will transform a field within a table in your schema. It has the following attributes:
keyName
- name of the field in your database; mandatorydataType
- determines the output data type for this field, which can be'string'
,'boolean'
,'integer'
,'float'
,'date'
,'timestamp'
or'integer_array'
, but defaults to'string'
if not present)isArray
-true
orfalse
; defaults tofalse
. Iftrue
then the data type will be aREPEATED
field of data typedataType
. May not be used withdataType: 'integer_array'
.),description
- a meta description of the fieldhidden
- a boolean, see section "Hidden fields" belowhiddenPolicyTagLocation
- a string, see section "Hidden fields" belowalias
- a name to give the field in outputs instead ofkeyName
pastKeyNames
- an array of strings, see section "Retaining access to data in renamed fields" belowhistoric
- a boolean, see section "Retaining access to historic fields" belowforeignKeyName
- a string, see section "Primary and foreign key constraints" belowforeignKeyTable
- a string, see section "Primary and foreign key constraints" below
An example of a dataSchema
is included in the installation instructions above and in example.js.
customEventSchema
shares the same format, except that parameters marked with an asterisk (*) above are not supported for custom events, and the eventType
parameter must also be specified - this is the event_type
of custom events configured in dfe-analytics.
You must update dataSchema
whenever a field or table is added or removed from your dfe-analytics
analytics.yml
file (often because it has been added or removed from your database), changes data type (for example, from timestamp
to date
), or changes name.
dfe-analytics-dataform contains 2 assertions which will cause your scheduled pipeline runs in Dataform to fail if:
- Data has been received about an entity in the database that is missing a field it was expecting. This failure will generate an error and prevent further queries in your pipeline from running.
- Data has been received about an entity in the database that contains a field (or entire table) it was not expecting. This failure will generate an error but will not prevent further queries in your pipeline from running. However, the new field(s) or table will not be included in
dfe-analytics-dataform
output until you update your configuration, and the error will continue to reoccur.
The output from the assertions in the run logs for the failed run in Dataform will list which field(s) and/or tables are missing or new. You may also wish to check with the rest of your team that the fields you are adding or removing are fields that they were expecting to start or stop streaming, rather than being unexpected behaviour.
In either case, you should update your dataSchema
configuration in dfe_analytics_dataform.js
in Dataform to add or remove configuration for that field, following the JSON format above.
You should not usually need to run a full refresh in this scenario. The only exception to this is if you have added, removed or updated the primaryKey
attribute for a table in the dataSchema
. If you have done this then you will need to run a full refresh on the entity_version
table in Dataform.
dfe-analytics-dataform
will not require you to update customEventSchema
. However, if your custom event schema changes you may wish to update customEventSchema
to match.
If a field in your database or custom event has been renamed one or more times, and if the data type of that field has not changed, you may merge data from that field stored under its previous names by adding the configuration pastKeyNames: ['past_key_name_1','past_key_name_2']
to the key configuration for that field. The value of that field will be handled as if it were called keyName
if a field with that name is present. If it is not present then the value will be set to the first value in pastKeyNames
. If no field with that name is present then the value will be set to the second value in pastKeyNames
, and so on. (Behind the scenes, this functions as a SQL COALESCE()
.)
If a field used to be included in custom event data or streamed entity event data for a particular table, but is no longer streamed, you may retain access to that data in historic versions of entities/events in that table by adding the configuration historic: true
to the key configuration for that field.
This may be useful if a field has been deleted from a table in the database or custom event, but you wish to continue to analyse past versions of that table/event which do contain that field, or changes between past versions of that table/event. For entities, the field will contain a NULL
value for the latest version of the table.
On rare occasions an entity in the application database may be created, updated or deleted, but no event streamed to your events table. This will result in missing or out of date data in tables generated by dfe-analytics
.
To detect incomplete or out of date tables, follow dfe-analytics
documentation to enable nightly checksums to be streamed to your events table and set the compareChecksums: true
parameter in the JSON passed to dfeAnalyticsDataform()
. If the entity event stream in BigQuery appears to be missing, then the checksum in the event will not match the checksum calculated from the entity event stream in BigQuery. This will cause the foo_entity_ids_do_not_match
assertion generated by dfe-analytics-dataform
to fail.
To monitor whether tables have been loaded accurately, query the entity_table_check_scheduled
table in your output dataset. To monitor whether imports have completed accurately, query the entity_table_check_import
table in your output dataset.
To correct incomplete or out of date tables, ask your developers to follow dfe-analytics
documentation to import the latest table contents to BigQuery. It is recommended that you limit this import to tables where checksums do not match.
An import will also cause dfe-analytics-dataform
to correct occasions when there is no entity deletion event present in the events table for entities which have in fact been deleted from the application database. Entities with IDs which were not included in the import but which still exist as latest versions of the entity in the event stream will be assumed to have been deleted at the time that the checksum for the earliest complete import that did not contain the entity was calculated.
Unless enableMonitoring: false
is set, dfe-analytics-dataform
will automatically send a small packet of summary monitoring data to the monitoring.pipeline_snapshots
table in the cross-service GCP project. This will include details of how many tables in your project have matching checksums, and how many rows are missing/extra in BigQuery when compared to the database.
Similar monitoring for custom event data is not supported because there is no source database which this data can be compared to.
Hidden fields
dfe-analytics-dataform
is intended for use with the dfe-analytics
Ruby gem. When used together they have the capability to use BigQuery column-level access control to restrict access to certain 'hidden' fields. dfe-analytics
is responsible for ensuring that hidden fields are streamed into the hidden_data
field in your events table, while dfe-analytics-dataform
is responsible for ensuring that data in that field remains hidden once it has been transformed into the other tables generated by dfe-analytics-dataform
. If your Dataform pipeline transforms hidden data into fields in other tables which you also wish to hide, you are responsible for attaching policy tags to these fields. Google provides documentation describing how to do this.
To use hidden field functionality in dfe-analytics-dataform
effectively, you must first:
- Set up IAM roles and policy tags in your GCP project, and attach the policy tags to the correct field in your events table, following the instructions documented here.
- Configure
dfe-analytics
to treat the desired fields as hidden by listing them in your application'sconfig/analytics_hidden_pii.yml
file as described here.
You must then configure the following dfe-analytics-dataform
parameters as appropriate:
hiddenPolicyTagLocation
- string; optional; top level or key level. The location of a policy tag that you wish to attach to hidden fields, for example projects/your-project-name/locations/europe-west2/taxonomies/your-taxonomy-id/policyTags/your-policy-tag-id. You must set this as a top level parameter when you calldfeAnalyticsDataform()
. You may in addition set this at key level for fields to override this value - for example, if you want to set specific data masking rules on particular fields. You may not override this parameter at key level for fields that are not your primary key, orcreated_at
orupdated_at
. If not set then no fields will be hidden.hidePrimaryKey
- boolean; optional; table level. Iftrue
the primary key for this table will have the policy tag athiddenPolicyTagLocation
attached to it. The primary key isid
by default, but may be configured via the table-levelprimaryKey
parameter. Not supported for custom events as they do not have primary keys.hidden
- boolean; optional; key level. Set this totrue
for an individual key within a table or custom event to ensure that this field has a policy tag attached to it.
dfe-analytics-dataform also contains two assertions which will fail and alert you when the data being streamed into the data
and hidden_data
fields in the events
table does not match the configuration specified in the parameters above:
hidden_pii_configuration_does_not_match_entity_events_streamed_yesterday
will fail when events were streamed since the beginning of yesterday which did not match your hidden field configuration. To prevent this failure, change yourdfe-analytics
configuration (analytics_hidden_pii.yml file) and/or yourdfe-analytics-dataform
configuration (parameters above) so that they match.hidden_pii_configuration_does_not_match_sample_of_historic_entity_events_streamed
will fail when some events in a sample representing 1% of historic events did not match your hidden field configuration. To prevent this failure, update these events to migrate the fields indata
intohidden_data
using theyour-eventSourceName-here_migrate_historic_events_to_current_hidden_pii_configuration
stored procedure whichdfe-analytics-dataform
will create in your Dataform output dataset.
If either of these assertions fail, most tables generated by dfe-analytics-dataform
will not be updated. This is essential to ensure that data hidden fields is not copied into non-hidden fields in other tables generated by dfe-analytics-dataform
.
dfe-analytics-dataform
is only capable of hiding fields in tables that it generates. If data is present in other tables in your BigQuery project, you must ensure that the correct policy tag is attached to these fields.
Google provides documentation describing how to do this for tables generated by Dataform (but not by dfe-analytics-dataform
).
To locate fields in your GCP project which have names which indicate that they may be human-readable Personally Identifiable Information (PII) or GDPR special category data, you may wish to execute this shell script in the GCP cloud console, from your local machine or within a Github codespace. This will identify all field names anywhere within your GCP project (except datasets beginning dataform
other than the dataform
dataset itself, because these are likely Dataform development or assertion datasets) which contain the patterns email
, name
, dob
, birth
, national
, insurance
, trn
, phone
, postcode
, address
, mobile
, passport
, driv
, ethni
, religio
, union
, sex
, gender
, orientation
or disabilit
(case insensitive), but do not have a policy tag attached.
dfe-analytics-dataform
automatically applies BigQuery primary key constraints to either the id
field or the field set to be the primary key with the table level primaryKey
parameter.
Optionally it also allows configuration of BigQuery foreign key constraints.
These constraints may accelerate JOIN
performance. See here for more information.
To configure these, add parameters to the configuration for that key in your dataSchema so that it looks like:
{
keyName: "other_thing_id",
dataType: "string",
description: "Description of this field to include in metadata here.",
foreignKeyTable: "other_things", // The name of the table the other_thing_id foreign key in this table relates *to* in your application database
foreignKeyName: "other_thing_id" // The name of the primary key in the foreignKeyTable table the other_thing_id foreign key relates to.
}
If foreignKeyName
is omitted, it defaults to id
.
To configure a data retention schedule for data processed by dfe-analytics-dataform
add expirationDays and/or webRequestEventExpirationDays
parameter(s) to your dfeAnalyticsDataform()
in Dataform as in the following example:
dfeAnalyticsDataform({
eventSourceName: "patent-a-biscuit-type",
bqEventsTableName: "events",
urlRegex: "www.patent-a-biscuit-type.gov.uk",
expirationDays: 2555, // set to the integer number of days you want to delete data in your BigQuery project after, or leave as false if you don't want to enforce a project-wide data retention schedule
webRequestEventExpirationDays: 365, // set to the integer number of days you want to delete web requests in your events table after
dataSchema: [{
entityTableName: "biscuit_types",
expirationDays: 365, // set to the integer number of days you want to delete entity events for this entityTableName after
keys: [{
keyName: "name",
dataType: "string"
}, {
keyName: "contains_chocolate",
dataType: "boolean"
}, {
keyName: "inventor_email_address",
dataType: "date"
}]
}],
customEventSchema: [{
eventType: "view_picture",
expirationDays: 365, // set to the integer number of days you want to delete custom events for this eventType after
keys: [{
keyName: "biscuit_type_id",
dataType: "string"
}]
}]
});
If a data retention schedule has been configured via the top-level, custom event-level or entity-level expirationDays
parameter and/or webRequestEventExpirationDays
parameter, dfe-analytics-dataform
will delete data in tables generated by dfe-analytics-dataform
to the extent that it is possible to prove that the data was last updated the appropriate number of days ago. This includes deleting:
- Events in the source events table or output events table which are older than
expirationDays
, which are entity create/update/delete/import events for an entity which are older thanexpirationDays
for that entity, or which are custom events which are older thanexpirationDays
for that custom event type, or which are web request events which are older thanwebRequestEventExpirationDays
- Versions of entities in the
entity_version
table which are older than either the top-level or entity-levelexpirationDays
parameter (whichever is more recent) - Updates to fields and versions of entities in the flattened entity field updates, version and latest tables which are older than either the top-level or entity-level
expirationDays
parameter (whichever is more recent) - Custom events in flattened custom events tables which are older than either the top-level or custom event-level
expirationDays
parameter (whichever is more recent) - Pageviews and sessions in the
pageview_with_funnels
,sessions
andsession_details
tables which are older thanexpirationDays
orwebRequestEventExpirationDays
(whichever is more recent) - Checksum calculations in the
entity_table_check_import
andentity_table_check_scheduled
tables which are older thanexpirationDays
Where possible deletion occurs via means of partition or table expiry to ensure that data deletion occurs even if the Dataform pipeline does not run. If this is not possible then in many cases tables are recreated from source each night, which has the effect of deleting data where it has been deleted from source. In the remaining cases dfe-analytics-dataform
will run DELETE
DML statements to achieve this.
For each occurrence of dfeAnalyticsDataform()
in definitions/dfe_analytics_dataform.js
this package will create the following automatically in your Dataform project. You can view and manage these within the Dataform UI by opening definitions/dfe_analytics_dataform.js
.
The names of these will vary depending on the eventSourceName
you have specified. For example if your eventSourceName
was foo
then the following will be created:
- An incremental table called
events_foo
, which you can access within a Dataform declaration using something likeSELECT * FROM ${ref("events_foo")}
. This table will include all the eventsdfe-analytics
streamed into the raw events table, filtered by namespace (if you configured this). It will also include details of the browser and operating system of the user who caused these events to be streamed, and will attach web request data (like the request path) to all events in the table, not just the web requests. - An assertion called
foo_events_data_not_fresh
which will fail if no new events have been streamed in the last day (or number of days specified in youreventsDataFreshnessDays
parameter). - An incremental table called
foo_entity_version
, containing each version of every entity in the database over time, with avalid_from
andvalid_to
timestamp. - A table called
foo_data_schema_json_latest
, which is a default dataSchema JSON you could use to get started specifying this in dfe_analytics_dataform.js - For each
entityTableName
you specified indataSchema
likebar
, tables or views (depending on thematerialisation
set) called something likebar_version_foo
andbar_latest_foo
.bar_version_foo
is a denormalised ('flattened') version offoo_version
, flattened according to the schema forfoo
you specified indataSchema
.bar_latest_foo
is the same asbar_version_foo
except that it only includes the latest version of each entity (i.e. withvalid_to IS NULL
). Both tables and fields within them will have metadata set to match the descriptions set indataSchema
. - For each
eventType
you specified incustomEventSchema
likebar
, an incremental table called something likebar_foo
.bar_foo
is a denormalised ('flattened') version ofevents_foo
, flattened according to the schema forfoo
you specified incustomEventSchema
. Metadata will be set to match the descriptions set incustomEventSchema
. - Table valued functions called something like
bar_at_foo
which can be used to generate a past version of a table at a specified timestamp. - If you have configured an entity-level data freshness assertion by including an entity-level
dataFreshnessDays
parameter in yourdataSchema
, an assertion called something likebar_data_not_fresh_foo
which fails if no new Create, Update or Delete events have been received in the lastdataFreshnessDays
days. - If you have set
compareChecksums: true
, an assertion called something likefoo_entity_ids_do_not_match
which fails if a checksum or a row count in anentity_table_check
event streamed yesterday does not match the corresponding checksum and/or row count in the corresponding data in BigQuery. - An assertion called something like
foo_entity_import_ids_do_not_match
which fails if a checksum or a row count in animport_entity_table_check
event streamed yesterday does not match the corresponding checksum and/or row count in the corresponding batch of import events. - Tables called
entity_table_check_import_foo
andentity_table_check_scheduled_foo
which allow you to monitor data load completeness by comparing checksums and row counts sent fromdfe-analytics
to checksums and row counts calculated from your events table. - Assertions to help spot when your
dataSchema
has become out of date or has a problem. These will tell you iffoo_entities_are_missing_expected_fields
or iffoo_unhandled_field_or_entity_is_being_streamed
. The former will halt your pipeline from executing, while the latter will just alert you to the assertion failure. - A table called
foo_entity_field_updates
, which contains one row for each time a field was updated for any entity that is streamed as events from the database, setting out the name of the field, the previous value of the field and the new value of the field. Entity deletions and updates to anyupdated_at
fields are not included, butNULL
values are. - For each
entityTableName
you specified indataSchema
likebar
, a table called something likebar_field_updates_foo
.bar_field_updates_foo
is a denormalised ('flattened') version offoo_entity_field_updates
, filtered down to the entitybar
, and with the new and previous values of that entity flattened according to the schema forfoo
you specified indataSchema
. Fields will have metadata set to match the descriptions set indataSchema
. - An incremental table called
pageview_with_funnels_foo
, which contains pageview events from the events table, along with two ARRAYs of STRUCTs containing a number of pageviews in either direction for use in funnel analysis. This number of pageviews is determined by thefunnelDepth
parameter you may optionally calldfeAnalyticsDataform()
with. By defaultfunnelDepth
is 10. Disabled ifenableSessionTables
isfalse
. - A table called
sessions_foo
, which contains rows representing user sessions with attribution fields (e.g. medium, referer_domain) for each session. Includes the session_started_at and next_session_started_at timestamps to allow attribution modelling of a goal conversion that occurred between those timestamps. Disabled ifenableSessionTables
isfalse
. - A table called
sessions_details_foo
, which contains rows representing user sessions with attribution fields (e.g. medium, referer_domain) for each session and an array of page views that occurred within the session. This table uses the Google Analytics definition of a session: A session is a group of user interactions with the website that that occur continuously without a break of more than 30 minutes of inactivity or until the user navigates away from the site. Disabled ifenableSessionDetailsTable
is false. - A table called
dfe_analytics_configuration_foo
which contains details of the configuration ofdfe-analytics
forfoo
over time, withvalid_from
andvalid_to
fields - A stored procedure called
migrate_foo_historic_events_to_current_hidden_pii_configuration
in your output Dataform dataset. You can invoke this to migrate past entity CRUD/import events to ensure that fields in thedata
andhidden_data
arrays in your source events table andevents_foo
table are in the array field that matches the hidden field configuration in yourdataSchema
. - Assertions called
foo_hidden_pii_configuration_does_not_match_entity_events_streamed_yesterday
andfoo_hidden_pii_configuration_does_not_match_sample_of_historic_entity_events_streamed
. See section "Hidden fields" above for more information. - An assertion called
foo_entities_have_not_been_backfilled
which fails if noimport_entity_table_check
orimport_entity
events exist in the source events table for a particularentity_table_name
configured indataSchema
. Likely causes of this are not running an import as part ofdfe-analytics
installation, not running an import on a newly created table in the application database, or deletion of the latest import of the table due to enforcement of a data retention schedule
Tests with the jest framework are filed in the tests
folder
Test are run locally using npm run test
to run all tests and npm run test <file>
to run a single test file.
The tests were written retrospectively, going forwards all new functions should have a matching test.
The tests follow a basic pattern. Most of the functions return an SQl query, accepting variable arguments and occasionally a boolean logic gate.
// require or mock functions
const dataFunctions = require('../includes/data_functions');
// describe the function to be tested
describe('stringToIntegerArray', () => {
it('should generate the correct SQL query for a valid input string', () => {
// Provide arguments
const input = '[3,75,2,1]';
// Provide the expected result, this can be deduced from either the actual
// function under test, or by running the test file itself which calls the
// function as below, then has an expect(result).toBe(expected) call.
// If the result does not match, the test output will provide the SQL for
// comparison.
const expectedSQL = `ARRAY(
SELECT
step_int
FROM
(
SELECT
SAFE_CAST(step AS INT64) AS step_int
FROM
UNNEST(
SPLIT(
TRIM(
${input},
"[]"
),
","
)
) AS step
)
WHERE
step_int IS NOT NULL
)`;
// call the function
const result = dataFunctions.stringToIntegerArray(input);
// check the expectation is true
expect(result).toBe(expectedSQL);
});
Full documentation can be found at Jest