diff --git a/docs/airflow/dags-maintenance.md b/docs/airflow/dags-maintenance.md index e8d87c90a6..baf64173ff 100644 --- a/docs/airflow/dags-maintenance.md +++ b/docs/airflow/dags-maintenance.md @@ -44,22 +44,13 @@ Failures can be cleared (re-run) via the Airflow user interface ([accessible via [This Airflow guide](https://airflow.apache.org/docs/apache-airflow/stable/ui.html) can help you use and interpret the Airflow UI. -### Deprecated DAGs - -The following DAGs may still be listed in the Airflow UI even though they are **deprecated or indefinitely paused**. They never need to be re-run. (They show up in the UI because the Airflow database has historical DAG/task entries even though the code has been deleted.) - -- `amplitude_benefits` -- `check_data_freshness` -- `load-sentry-rtfetchexception-events` -- `unzip_and_validate_gtfs_schedule` - ## `PodOperators` -When restarting a failed `PodOperator` run, check the logs before restarting. If the logs show any indication that the prior run's pod was not killed (for example, if the logs cut off abruptly without showing an explicit task failure), you should check that the pod associated with the failed run task has in fact been killed before clearing or restarting the Airflow task. If you don't know how to check a pod status, please ask in the `#data-infra` channel on Slack before proceeding. +When restarting a failed run of a DAG that utilizes a `PodOperator`, check the logs before restarting. If the logs show any indication that the prior run's pod was not killed (for example, if the logs cut off abruptly without showing an explicit task failure), you should check that the [Kubernetes pod](https://kubernetes.io/docs/concepts/workloads/pods/) associated with the failed run task has in fact been killed before clearing or restarting the Airflow task. Users with proper access to Kubernetes Engine in Google Cloud can check for any [live workloads]() that correspond to the pod referenced in the failed Airflow task's run logs. ## Backfilling from the command line -From time-to-time some DAGs may need to be re-ran in order to populate new data. +From time to time some DAGs may need to be re-run in order to populate new data. Subject to the considerations outlined above, backfilling can be performed by clearing historical runs in the web interface, or via the CLI: diff --git a/docs/architecture/architecture_overview.md b/docs/architecture/architecture_overview.md index c126365618..2c4b3335ea 100644 --- a/docs/architecture/architecture_overview.md +++ b/docs/architecture/architecture_overview.md @@ -53,6 +53,10 @@ This documentation outlines two ways to think of this system and its components - [Services](services) that are deployed and maintained (ex. Metabase, JupyterHub, etc.) - [Data pipelines](data) to ingest specific types of data (ex. GTFS Schedule, Payments, etc.) +Outside of this documentation, several READMEs cover initial development environment setup for new users. The [/warehouse README](https://github.com/cal-itp/data-infra/blob/main/warehouse) and the [/airflow README](https://github.com/cal-itp/data-infra/blob/main/airflow) in the Cal-ITP data-infra GitHub repository are both essential starting points for getting up and running as a contributor to the Cal-ITP code base. The [repository-level README](https://github.com/cal-itp/data-infra) covers some important configuration steps and social practices for contributors. + +NOTE: sections of the /warehouse README discussing installation and use of JupyterHub are likely to be less relevant to infrastructure, pipeline, package, image, and service development than they are for analysts who work primarily with tables in the warehouse - most contributors performing "development" work on Cal-ITP tools and infrastructure use a locally installed IDE like VS Code rather than relying on the hosted JupyterHub environment for that work, since that environment is tailored to analysis tasks and is somewhat limited for development and testing tasks. Some documentation on this site and in the repository has a shared audience of developers and analysts, and as such you can expect that documentation to make occasional reference to JupyterHub even if it's not a core requirement for the type of work being discussed. + ## Environments Across both data and services, we often have a "production" (live, end-user-facing) environment and some type of testing, staging, or development environment. diff --git a/docs/architecture/data.md b/docs/architecture/data.md index 47b5dd3f5e..da946d79ae 100644 --- a/docs/architecture/data.md +++ b/docs/architecture/data.md @@ -2,12 +2,18 @@ # Data pipelines -In general, our data ingest follows versions of the pattern diagrammed below. For an example PR that ingests a brand new data source from scratch, see [data infra PR #2376](https://github.com/cal-itp/data-infra/pull/2376). +In general, our data ingest follows customized versions of a consistent pattern: -Some of the key attributes of our approach: +1. Sync raw data into Google Cloud Storage (GCS), and parse it into a BigQuery-readable form +2. Create [external tables](https://cloud.google.com/bigquery/docs/external-tables) in BigQuery that read the parsed data from GCS +3. Model and transform the resulting tables using [dbt](https://docs.getdbt.com/docs/introduction) -- We generate an [`outcomes`](https://github.com/cal-itp/data-infra/blob/main/packages/calitp-data-infra/calitp_data_infra/storage.py#L418) file describing whether scrape, parse, or validate operations were successful. This makes operation outcomes visible in BigQuery, so they can be analyzed (for example: how long has the download operation for X feed been failing?) -- We try to limit the amount of manipulation in Airflow tasks to the bare minimum to make the data legible to BigQuery (for example, replace illegal column names that would break the external tables.) We use gzipped JSONL files in GCS as our default parsed data format. +That pattern is diagrammed and discussed in more detail below. For an example PR that ingests a brand new data source from scratch, see [data infra PR #2376](https://github.com/cal-itp/data-infra/pull/2376). + +Some of the key attributes of our approach, shared across data sources: + +- We generate an [`outcomes`](https://github.com/cal-itp/data-infra/blob/main/packages/calitp-data-infra/calitp_data_infra/storage.py#L372) file at each ingestion step describing whether scrape, parse, or validate operations were successful. This makes operation outcomes visible in BigQuery, so they can be analyzed (for example: how long has the download operation for X feed been failing?) +- We try to limit the amount of data manipulation in Airflow tasks to the bare minimum required to make the data legible to BigQuery (for example, replace illegal column names that would break the external tables.) We use gzipped JSONL files in GCS as our default parsed data format. Data transformation is generally handled downstream via dbt, rather than as part of the initial pipeline. - [External tables](https://cloud.google.com/bigquery/docs/external-data-sources#external_tables) provide the interface between ingested data and BigQuery modeling/transformations. While many of the key elements of our architecture are common to most of our data sources, each data source has some unique aspects as well. [This spreadsheet](https://docs.google.com/spreadsheets/d/1bv1K5lZMnq1eCSZRy3sPd3MgbdyghrMl4u8HvjNjWPw/edit#gid=0) details overviews by data source, outlining the specific code/resources that correspond to each step in the general data flow shown below. @@ -119,40 +125,40 @@ class gcs_label1,gcs_label2,gcs_label3,bq_label1,bq_label2 group_labelstyle Adding a new data source based on the architecture described above involves several steps, outlined below. ```{note} -If you're bringing in data that is similar to existing data (for example, a new subset of an existing dataset like a new Airtable or Littlepay table), you should follow the existing pattern for that dataset. [This spreadsheet](https://docs.google.com/spreadsheets/d/1bv1K5lZMnq1eCSZRy3sPd3MgbdyghrMl4u8HvjNjWPw/edit#gid=0) gives overviews of some prominent existing data sources, outlining the specific code/resources that correspond to each step in the [general data flow](data-ingest-diagram) for that data source. +If you're bringing in data that is similar to existing data (for example, a new subset of an existing dataset like a new Airtable or Littlepay table), you should follow the existing patterns for that dataset. [This spreadsheet](https://docs.google.com/spreadsheets/d/1bv1K5lZMnq1eCSZRy3sPd3MgbdyghrMl4u8HvjNjWPw/edit#gid=0) gives overviews of some prominent existing data sources, outlining the specific code/resources that correspond to each step in the [general data flow](data-ingest-diagram) for that data source. ``` -### Determine upstream source type +### 0. Decide on an approach -To determine the best storage location for your raw data (especially if it requires manual curation), consult the [Data Collection and Storage Guidance within the Cal-ITP Data Pipeline Google Doc](https://docs.google.com/document/d/1-l6c99UUZ0o3Ln9S_CAt7iitGHvriewWhKDftESE2Dw/edit). +To determine the most appropriate ingest approach and storage location for your raw data (especially if that data requires manual curation), consult the [Data Collection and Storage Guidance within the Cal-ITP Data Pipeline Google Doc](https://docs.google.com/document/d/1-l6c99UUZ0o3Ln9S_CAt7iitGHvriewWhKDftESE2Dw/edit). The [Should it be a dbt model?](tool_choice) docs section also has some guidance about when a data pipeline should be created. -### Bring data into Google Cloud Storage +### 1. Bring data into Google Cloud Storage -We store our raw, un-transformed data in Google Cloud Storage, usually in perpetuity, to ensure that we can always recover the raw data if needed. +We store our raw, un-transformed data in Google Cloud Storage, usually in perpetuity, to ensure that we can always recover the raw data if needed. This allows us to fully re-process and re-transform historical data as needed rather than relying solely on old dashboards and reports, a powerful tool to create new uses for old data over time. -We store data in [hive-partitioned buckets](https://cloud.google.com/bigquery/docs/hive-partitioned-queries#supported_data_layouts) so that data is clearly labeled and partitioned for better performance. We use UTC dates and timestamps in hive paths (for example, for the timestamp of the data extract) for consistency. +We store data in [hive-partitioned buckets](https://cloud.google.com/bigquery/docs/hive-partitioned-queries#supported_data_layouts) so that data is clearly labeled and partitioned for better performance - discussed more in the next section. We use UTC dates and timestamps in hive paths (for example, for the timestamp of the data extract) for consistency across all data sources in the ecosystem. You will need to set up a way to bring your raw data into the Cal-ITP Google Cloud Storage environment. Most commonly, we use [Airflow](https://airflow.apache.org/) for this. -The [Airflow README in the data-infra repo](https://github.com/cal-itp/data-infra/tree/main/airflow#readme) has information about how to set up Airflow locally for testing and how the Airflow project is structured. +The [Airflow README in the data-infra repo](https://github.com/cal-itp/data-infra/tree/main/airflow#readme) has information about how to set up Airflow locally for testing and how the Cal-ITP Airflow project is structured. We often bring data into our environment in two steps, created as two separate Airflow [DAGs](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html): - **Sync the fully-raw data in its original format:** See for example the changes in the `airflow/dags/sync_elavon` directory in [data-infra PR #2376](https://github.com/cal-itp/data-infra/pull/2376/files) (note: this example is typical in terms of its overall structure and use of Cal-ITP storage classes and methods, but the specifics of how to access and request the upstream data source will vary). We do this to preserve the raw data in its original form. This data might be saved in a `calitp--raw` bucket. -- **Convert the saved raw data into a BigQuery-readable gzipped JSONL file:** See for example the changes in the `airflow/dags/parse_elavon` directory in [data-infra PR #2376](https://github.com/cal-itp/data-infra/pull/2376/files). This prepares the data is to be read into BigQuery. **Conversion here should be limited to the bare minimum needed to make the data BigQuery-compatible, for example converting column names that would be invalid in BigQuery and changing the file type to gzipped JSONL.** This data might be saved in a `calitp--parsed` bucket. +- **Convert the saved raw data into a BigQuery-readable gzipped JSONL file:** See for example the changes in the `airflow/dags/parse_elavon` directory in [data-infra PR #2376](https://github.com/cal-itp/data-infra/pull/2376/files). This prepares the data is to be read into BigQuery. **Conversion here should be limited to the bare minimum needed to make the data BigQuery-compatible - converting column names that would be invalid in BigQuery, changing the file type to gzipped JSONL, etc.** Note that conversion to JSONL is widespread across Cal-ITP pipelines because that data format is easy to read by BigQuery external tables in the next step of the ingest process, while also supporting complex or nested data structures. This data might be saved in a `calitp--parsed` bucket. ```{note} When you merge a pull request creating a new Airflow DAG, that DAG will be paused by default. To start the DAG, someone will need to log into [the Airflow UI (requires Composer access in Cal-ITP Google Cloud Platform instance)](https://b2062ffca77d44a28b4e05f8f5bf4996-dot-us-west2.composer.googleusercontent.com/home) and unpause the DAG. ``` -### Create external tables +### 2. Create external tables -We use [external tables](https://cloud.google.com/bigquery/docs/external-data-sources#external_tables) to allow BigQuery to query data stored in Google Cloud Storage. External tables do not move data into BigQuery; they simply define the data schema which BigQuery can then use to access the data still stored in Google Cloud Storage. +We use [external tables](https://cloud.google.com/bigquery/docs/external-data-sources#external_tables) to allow BigQuery to query data stored in Google Cloud Storage. External tables do not move data into BigQuery; they simply define the data schema which BigQuery can then use to access the data still stored in Google Cloud Storage. Because we used Hive-partitioned file naming conventions in the previous step, BigQuery can save significant resources when querying external tables by targeting only a subset of the simulated subfolders in a given GCS bucket when corresponding filters are applied (like a filter on a "dt" field represented in a partition). External tables are created by the [`create_external_tables` Airflow DAG](https://github.com/cal-itp/data-infra/tree/main/airflow/dags/create_external_tables) using the [ExternalTable custom operator](https://github.com/cal-itp/data-infra/blob/main/airflow/plugins/operators/external_table.py). Testing guidance and example YAML for how to create your external table is provided in the [Airflow DAG documentation](https://github.com/cal-itp/data-infra/tree/main/airflow/dags/create_external_tables#create_external_tables). -### dbt modeling +### 3. dbt modeling -Considerations for dbt modeling are outlined on the [Developing models in dbt](developing-dbt-models) page. +After reading the parsed raw files into external tables, those external tables are generally not queried directly by analysts or by dashboards. Instead, they are made useful through modeling and transformations managed by dbt, a tool that enhances SQL-based workflows with concepts from software engineering, like version control. Guidance for dbt modeling is outlined on the [Developing models in dbt](developing-dbt-models) page of this docs site. diff --git a/docs/architecture/images_and_packages.md b/docs/architecture/images_and_packages.md index 1c345b0382..758e470405 100644 --- a/docs/architecture/images_and_packages.md +++ b/docs/architecture/images_and_packages.md @@ -4,7 +4,7 @@ Within Cal-ITP, we publish several Python packages and Docker images that are th Some images and packages manage dependencies via traditional requirements.txt files, and some manage dependencies via [Poetry `pyproject.toml` files](https://python-poetry.org/docs/pyproject/). Please refer to Poetry documentation for successful management of pyproject dependencies. -READMEs describing the individual testing and publication process for each image and package are linked in the below table. +READMEs describing the individual testing and publication process for each image and package are linked in the below table. A detailed guide for updating the calitp-data-analysis package is available [here](https://docs.calitp.org/data-infra/analytics_tools/python_libraries.html#updating-calitp-data-analysis), written for an analyst audience. | Name | Function | Source Code | README | Publication URL | Type | | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------- | ---------------------------------------------------------- | -------------- | diff --git a/docs/architecture/services.md b/docs/architecture/services.md index bd96778e29..3f4c92ba3c 100644 --- a/docs/architecture/services.md +++ b/docs/architecture/services.md @@ -4,7 +4,7 @@ Many services and websites are deployed as part of the Cal-ITP ecosystem, mainta With the exception of Airflow, which is [managed via Google Cloud Composer](https://github.com/cal-itp/data-infra/tree/main/airflow#upgrading-airflow-itself), changes to the services discussed here are deployed via CI/CD processes that run automatically when new code is merged to the relevant Cal-ITP repository. These CI/CD processes are not all identical - different services have different testing steps that run when a pull request is opened against the services's code. Some services undergo a full test deployment when a PR is opened, some report the changes that a subject [Helm chart](https://helm.sh/docs/topics/charts/) will undergo upon merge, and some just perform basic linting. -READMEs describing the individual testing and deployment process for each service are linked in the below table, and [the CI README](https://github.com/cal-itp/data-infra/tree/main/ci/README.md) provides some more general context for Kubernetes-based deployments. +READMEs describing the individual testing and deployment process for each service are linked in the below table, and [the CI README](https://github.com/cal-itp/data-infra/tree/main/ci/README.md) provides some more general context for Kubernetes-based deployments. Many services are monitored via Sentry, discussed in [a later section](#error-monitoring-through-sentry). | Name | Function | URL | Source code and README (if present) | K8s namespace | Development/test environment? | Service Type | | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------- | ------------------ | -------------------------------- | ------------------------------ | @@ -80,3 +80,19 @@ classDef default fill:white, color:black, stroke:black, stroke-width:1px classDef group_labelstyle fill:#cde6ef, color:black, stroke-width:0px class repos_label,kubernetes_label,netlify_label,github_pages_label group_labelstyle ``` + +## Monitoring running services + +(error-monitoring-through-sentry)= + +### Error monitoring through Sentry + +A subset of our services and sites send error information to the Cal-ITP Sentry instance, which groups errors based on criteria we define in order to identify and track new errors, regressions, and intermittent service issues. A runbook is available [here](https://github.com/cal-itp/data-infra/blob/main/runbooks/workflow/sentry-triage.md) which discusses daily triage of events logged in Sentry, and general documentation for self-hosted instances of Sentry like ours is available [here](https://develop.sentry.dev/self-hosted/). + +### Cost and performance monitoring + +In addition to standard Google Cloud tooling for monitoring specific services like Kubernetes Engine, Composer, and BigQuery, we maintain a couple dashboards that can make it easier to glean at-at-glance insights and report topline information to stakeholders who don't have the time, access, or knowledge to dig into GCP monitoring directly. + +The [BigQuery overview dashboard](https://dashboards.calitp.org/dashboard/76-bigquery-overview-dashboard?principal_email_substring=&time_window=past7days) gives a detailed view of daily and monthly BigQuery costs, and helps identify tables and dbt models that are producing greater build costs, reference costs, and query costs than others. + +In a prior staffing configuration, a person was assigned to fill the [Cal-ITP System Performance and Outcomes Monitoring dashboard](https://dashboards.calitp.org/dashboard/138-cal-itp-system-performance-and-outcomes-monitoring?single_date=2023-06-22) and use it to populate the metrics in the spreadsheet linked within the dashboard each Friday. Checks were recommended to occur on Fridays because for one of the metrics, Google auto-bins 7 day periods to Friday-Thursday. This dashboard and spreadsheet could be re-activated if desired. diff --git a/docs/requirements.txt b/docs/requirements.txt index 154db0e522..55d970032a 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -1,3 +1,4 @@ +sqlalchemy-bigquery==1.10.0 calitp-data-analysis==2024.3.27 jupyter-book==1.0.0 sphinxcontrib-mermaid==0.8.1 diff --git a/docs/transit_database/transitdatabase.md b/docs/transit_database/transitdatabase.md index b4cfd7e43c..4364cc3509 100644 --- a/docs/transit_database/transitdatabase.md +++ b/docs/transit_database/transitdatabase.md @@ -82,8 +82,6 @@ The following entity relationship diagrams were last updated in 2022 but are pre [editable source](https://mermaid-js.github.io/mermaid-live-editor/edit/#pako:eNqdk7tuwzAMRX9F0JzH7jXp0ClF09ELITGyAFs0KClFG-ffS7_6SNu0iEbp3MtLSjppQxZ1oZG3HhxDUwYla8cOgn-F5ClEde6WSzqpPfLRGyxUqRsI4DCW-kecBnxDITGY1PP0HP4PV1Tb6_QDk80jHLGuB3jEp4wbaloKGJIoVqvuS3YfVY5oVSLV1hDWYy9rapEh4Vz3N6NPpcUoSlQF8bqoU-8bk0wa9cH9JbwYi-hapqO3ffaKKbvqo-9HrMcRVb79ZtZ1Q4rL_d50x975AEOcOJ4rMwNzulvNn4Adpht9pwlsIcHeVNhA73gfxSUENEmGkKOkLrVe6Aa5AW_lHZ9651InEchV9hKLB8j1UPMsaG6t3PKd9YlYFweoIy405ET7l2B0kTjjDE0_YqLOb7JEHuQ) -## Dashboards - ## DAGs Maintenance -You can find further information on DAGs maintenance for Transit Database data [on this page](dags-maintenance). +You can find further information on how to maintain the DAGs for Transit Database data [on this page](dags-maintenance), which covers general Airflow maintenance and troubleshooting patterns. diff --git a/runbooks/workflow/sentry-triage.md b/runbooks/workflow/sentry-triage.md index 127dd166fd..d76f9545f1 100644 --- a/runbooks/workflow/sentry-triage.md +++ b/runbooks/workflow/sentry-triage.md @@ -1,12 +1,12 @@ # Sentry triage -> Important Sentry concepts: -> +Sentry is a powerful tool for monitoring appliction errors and other targeted events in deployed services and sites. We self-host an instance of Sentry on Cal-ITP infrastructure, available [here](https://sentry.calitp.org/). To make use of the tool to its fullest extent, please read up on some important Sentry concepts: + > - [Issue states](https://docs.sentry.io/product/issues/states-triage/) > - [Fingerprinting and grouping](https://docs.sentry.io/product/sentry-basics/grouping-and-fingerprints/) > - [Merging issues](https://docs.sentry.io/product/data-management-settings/event-grouping/merging-issues/) -Once a day, the person responsible for triage should check Sentry for new and current issues. There are two separate things to check: +When we encounter a new application error in an application monitored by Sentry, an alert is generated in the #alerts-data-infra channel in the Cal-ITP Slack group. In addition to checking on those errors, it is valuable to regularly check Sentry for new and current issues, ideally daily. There are two separate things to check: - All **new issues** from the past 24 hours. An issue is a top-level error/failure/warning, and a new issue represents something we have't seen before (as opposed to a new event instance of an issue that's been occurring for a while). These should be top priority to investigate since they represent net-new problems. @@ -23,11 +23,11 @@ Categorize the issues/events identified and perform relevant steps if the issue ## GitHub issues -When creating GitHub issues from Sentry: +Sentry includes push-button functionality to generate GitHub issues directly from a Sentry issue page. When creating GitHub issues from Sentry: - Verify that no secrets or other sensitive information is contained in the generated issue body. Sentry's data masking is not perfect (and we may make a configuration mistake), so it's good to double-check. -- Clean up the issue so that someone looking at it later will understand what the error actually is. The auto-generated issues will only contain the exception text and a link back to Sentry; making a more human-friendly issue title and description is helpful. +- Clean up the issue so that someone looking at it later will understand what the error actually is. The auto-generated issues produced by Sentry will only contain the exception text and a link back to Sentry; making a more human-friendly issue title and description is helpful. ## Issue types @@ -61,7 +61,3 @@ This category primarily includes unhandled data processing exceptions (e.g. RTFe 1. Create a GitHub issue to update the fingerprint, usually adding additional values to the fingerprint to distinguish between different errors. 2. For example, you may want to split up an issue by feed URL, which would mean adding the feed URL to the fingerprint. 3. When the new fingerprint has been deployed, _resolve_ the existing issue since it should no longer appear. - -## Additional Triage Task: Friday Performance Check - -Each Friday, the person assigned to Sentry triage should use the [Cal-ITP System Performance and Outcomes Monitoring dashboard](https://dashboards.calitp.org/dashboard/138-cal-itp-system-performance-and-outcomes-monitoring?single_date=2023-06-22) to populate the metrics in the spreadsheet linked within the dashboard. (Checks recommended to occur on Fridays because for one of the metrics, Google auto-bins 7 day periods to Friday-Thursday.)