Skip to content

Commit

Permalink
Merge pull request #3554 from cal-itp/curriculum_docs_update
Browse files Browse the repository at this point in the history
Edit and Update: Enhancements to Cal-ITP Data Services Documentation
  • Loading branch information
shweta487 authored Nov 27, 2024
2 parents 0de0734 + cc13683 commit 3060446
Show file tree
Hide file tree
Showing 10 changed files with 100 additions and 39 deletions.
9 changes: 6 additions & 3 deletions docs/analytics_onboarding/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,16 +32,19 @@

**Python Libraries:**

- [ ] **calitp-data-analysis** - Cal-ITP's internal Python library for analysis | ([Docs](calitp-data-analysis))
- [ ] **siuba** - Recommended data analysis library | ([Docs](siuba))
- [ ] [**shared_utils**](https://github.com/cal-itp/data-analyses/tree/main/_shared_utils) and [**here**](https://github.com/cal-itp/data-infra/tree/main/packages/calitp-data-analysis/calitp_data_analysis) - A shared utilities library for the analytics team | ([Docs](shared-utils))
- [ ] [**calitp-data-analysis**](https://github.com/cal-itp/data-infra/tree/main/packages/calitp-data-analysis/calitp_data_analysis) - Cal-ITP's internal Python library for analysis | ([Docs](calitp-data-analysis))
- [ ] [**siuba**](https://siuba.org/) - Recommended data analysis library | ([Docs](siuba))
- [ ] [**shared_utils**](https://github.com/cal-itp/data-analyses/tree/main/_shared_utils) - A shared utilities library for the analytics team | ([Docs](shared-utils))

**Caltrans Employee Resources:**

- [ ] [**Organizational Chart**](https://pmp.onramp.dot.ca.gov/organizational-chart) - Data and Digital Services Organizational Chart
- [ ] [**OnRamp**](https://onramp.dot.ca.gov/) - Caltrans employee intranet
- [ ] [**Service Now (SNOW)**](https://cdotprod.service-now.com/sp) - Caltrans IT Service Management Portal for IT issues and requesting specific software
- [ ] [**Cal Employee Connect**](https://connect.sco.ca.gov/) - State Controller's Office site for paystubs and tax information
- [ ] [**Geospatial Enterprise Engagement Platform - GIS Account Request Form**](https://sv03tmcpo.ct.dot.ca.gov/portal/apps/sites/#/geep/pages/account-request) (optional) - User request form for ArcGIS Online and ArcGIS Portal accounts
- [ ] [**Planning Handbook**](https://transportationplanning.onramp.dot.ca.gov/caltrans-transportation-planning-handbook) - Caltrans Transportation Planning Handbook
- [ ] [**California Public Employees Retirement System**](https://www.calpers.ca.gov/) - System that manages pension and health benefits

 
(get-help)=
Expand Down
33 changes: 25 additions & 8 deletions docs/analytics_tools/jupyterhub.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,15 @@ Analyses on JupyterHub are accomplished using notebooks, which allow users to mi

01. [Using JupyterHub](#using-jupyterhub)
02. [Logging in to JupyterHub](#logging-in-to-jupyterhub)
03. [Connecting to the Warehouse](#connecting-to-the-warehouse)
04. [Increasing the Query Limit](#increasing-the-query-limit)
05. [Increase the User Storage Limit](#increasing-the-storage-limit)
06. [Querying with SQL in JupyterHub](querying-sql-jupyterhub)
07. [Saving Code to Github](saving-code-jupyter)
08. [Environment Variables](#environment-variables)
09. [Jupyter Notebook Best Practices](notebook-shortcuts)
10. [Developing warehouse models in Jupyter](jupyterhub-warehouse)
03. [Default vs Power User](#default-user-vs-power-user)
04. [Connecting to the Warehouse](#connecting-to-the-warehouse)
05. [Increasing the Query Limit](#increasing-the-query-limit)
06. [Increase the User Storage Limit](#increasing-the-storage-limit)
07. [Querying with SQL in JupyterHub](querying-sql-jupyterhub)
08. [Saving Code to Github](saving-code-jupyter)
09. [Environment Variables](#environment-variables)
10. [Jupyter Notebook Best Practices](notebook-shortcuts)
11. [Developing warehouse models in Jupyter](jupyterhub-warehouse)

(using-jupyterhub)=

Expand All @@ -39,6 +40,22 @@ JupyterHub currently lives at [notebooks.calitp.org](https://notebooks.calitp.or

Note: you will need to have been added to the Cal-ITP organization on GitHub to obtain access. If you have yet to be added to the organization and need to be, ask in the `#services-team` channel in Slack.

(default-user-vs-power-user)=

### Default User vs Power User

#### Default User

Designed for general use and is ideal for less resource-intensive tasks. It's a good starting point for most users who don't expect to run very large, memory-hungry jobs.

Default User profile offers quick availability since it uses less memory and can allocate a smaller node, allowing you to start tasks faster. However, if your task grows in memory usage over time, it may exceed the node's capacity, potentially causing the system to terminate your job. This makes the Default profile best for smaller or medium-sized tasks that don’t require a lot of memory. If your workload exceeds these limits, you might experience instability or crashes.

#### Power User

Intended for more demanding, memory-intensive tasks that require more resources upfront. This profile is suitable for workloads that have higher memory requirements or are expected to grow during execution.

Power User profile allocates a full node or a significant portion of resources to ensure your job has enough memory and computational power, avoiding crashes or delays. However, this comes with a longer wait time as the system needs to provision a new node for you. Once it's ready, you'll have all the resources necessary for memory-intensive tasks like large datasets or simulations. The Power User profile is ideal for jobs that might be unstable or crash on the Default profile due to higher resource demands. Additionally, it offers scalability—if your task requires more resources than the initial node can provide, the system will automatically spin up additional nodes to meet the demand.

(connecting-to-the-warehouse)=

### Connecting to the Warehouse
Expand Down
26 changes: 24 additions & 2 deletions docs/analytics_tools/knowledge_sharing.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

# Helpful Links

Here are some resources data analysts have collected and referenced, that will hopefully help you out in your work. Have something you want to share? Create a new markdown file, add it [to the example report folder](https://github.com/cal-itp/data-analyses/tree/main/example_report), and [message Amanda.](https://app.slack.com/client/T014965JTHA/C013N8GELLF/user_profile/U02PCTPSZ8A)
Here are some resources data analysts have collected and referenced, that will hopefully help you out in your work.

- [Data Analysis](#data-analysis)
- [Python](#python)
Expand All @@ -11,12 +11,14 @@ Here are some resources data analysts have collected and referenced, that will h
- [Merging](#merging)
- [Dates](#dates)
- [Monetary Values](#monetary-values)
- [Tidy Data](#tidy-data)
- [Visualizations](#visualization)
- [Charts](#charts)
- [Maps](#maps)
- [DataFrames](#dataframes)
- [Ipywidgets](#ipywidgets)
- [Markdown](#markdown)
- [ReviewNB](#reviewNB)

(data-analysis)=

Expand Down Expand Up @@ -128,6 +130,20 @@ def adjust_prices(df):
return df
```

(tidy-data)=

### Tidy Data

Tidy Data follows a set of principles that ensure the data is easy to work with, especially when using tools like pandas and matplotlib. Primary rules of tidy data are:

- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.

Tidy data ensures consistency, making it easier to work with tools like pandas, matplotlib, or seaborn. It also simplifies data manipulation, as functions like `groupby()`, `pivot()`, and `melt()` work more intuitively when the data is structured properly. Additionally, tidy data enables vectorized operations in pandas, allowing for efficient analysis on entire columns or rows at once.

Learn more about Tidy Data [here.](https://vita.had.co.nz/papers/tidy-data.pdf)

(visualization)=

## Visualization
Expand Down Expand Up @@ -159,7 +175,6 @@ def add_tooltip(chart, tooltip1, tooltip2):

### Maps

- [Examples of folium, branca, and color maps.](https://nbviewer.org/github/python-visualization/folium/blob/v0.2.0/examples/Colormaps.ipynb)
- [Quick interactive maps with Geopandas.gdf.explore()](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.explore.html)

(dataframes)=
Expand Down Expand Up @@ -188,3 +203,10 @@ def add_tooltip(chart, tooltip1, tooltip2):
- [Add a table of content that links to headers throughout a markdown file.](https://stackoverflow.com/questions/2822089/how-to-link-to-part-of-the-same-document-in-markdown)
- [Add links to local files.](https://stackoverflow.com/questions/32563078/how-link-to-any-local-file-with-markdown-syntax?rq=1)
- [Direct embed an image.](https://datascienceparichay.com/article/insert-image-in-a-jupyter-notebook/)

(reviewNB)=

### ReviewNB on GitHub

- [Tool designed to facilitate reviewing Jupyter Notebooks in a collaborative setting on GitHub](https://www.reviewnb.com/)
- [Shows side-by-side diffs of Jupyter Notebooks, including changes to both code cells and markdown cells and allows reviewers to comment on specific cells](https://www.reviewnb.com/#faq)
16 changes: 16 additions & 0 deletions docs/analytics_tools/saving_code.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,16 @@ Doing work locally and pushing directly from the command line is a similar workf
## Table of Contents

1. What's a typical [project workflow](#project-workflow)?

2. Someone is collaborating on my branch, how do we [stay in sync](#pulling-and-pushing-changes)?

- The `main` branch is ahead, and I want to [sync my branch with `main`](#rebase-and-merge)
- [Rebase](#rebase) or [merge](#merge)
- Options to [Resolve Merge Conflicts](#resolve-merge-conflicts)
- [Other Common Issues](#other-common-github-issues-encountered-during-saving-codes)

3. [Other Common GitHub Commands](#other-common-github-commands)

- [External Git Resources](#external-git-resources)
- [Committing in the Github User Interface](#pushing-drag-drop)

Expand Down Expand Up @@ -111,6 +116,17 @@ If you discover merge conflicts and they are within a single notebook that only
`git checkout --theirs path/to/notebook.ipynb`
- From here, just add the file and commit with a message as you normally would and the conflict should be fixed in your Pull Request.

(other-common-github-issues-encountered-during-saving-codes)

### Other Common Issues

- Untracked Files:
Sometimes, files are created or modified locally but are not added to Git before committing, so they are not tracked or pushed to GitHub. Use `git add <filename>` to track files before committing.
- Incorrect Branches:
Committing to the wrong branch (e.g., main instead of a feature branch) can cause problems, especially if changes are not meant to be merged into the main codebase. Always ensure you're on the correct branch using git branch and switch branches with `git switch -c <branch-name>` before committing.
- Merge Conflicts from Overlapping Work:
When multiple analysts work on the same files or sections of code, merge conflicts can occur. Creating feature branches and pulling regularly to stay updated with main can help avoid these conflicts.

(other-common-github-commands)=

## Other Common GitHub Commands
Expand Down
2 changes: 1 addition & 1 deletion docs/analytics_tools/tools_quick_links.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
| Tool | Purpose |
| -------------------------------------------------------------------------------------------------- | --------------------------------------- |
| [**Analytics Repo**](https://github.com/cal-itp/data-analyses) | Analytics team code repository. |
| [**Analytics Project Board**](https://github.com/cal-itp/data-analyses/projects/1) | Analytics team work management. |
| [**Analytics Project Board**](https://github.com/cal-itp/data-analyses/projects/1) | Analytics team list of active issues. |
| [**notebooks.calitp.org**](https://notebooks.calitp.org/) | JupyterHub cloud-based notebooks |
| [**dashboards.calitp.org**](https://dashboards.calitp.org/) | Metabase dashboards & Business Insights |
| [**dbt-docs.calitp.org**](https://dbt-docs.calitp.org/) | DBT warehouse documentation |
Expand Down
20 changes: 0 additions & 20 deletions docs/analytics_welcome/how_we_work.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,26 +27,6 @@ The section below outlines our team's primary meetings and their purposes, as we
| #**data-office-hours** | Discussion | A place to bring questions, issues, and observations for team discussion. |
| #**data-warehouse-devs** | Discussion | For people building dbt models - focused on data warehouse performance considerations, etc. |

## Collaboration Tools

(analytics-project-board)=

### GitHub Analytics Project Board

**You can access The Analytics Project Board [using this link](https://github.com/cal-itp/data-analyses/projects/1)**.

#### How We Track Work

##### Screencast - Navigating the Board

The screencast below introduces:

- Creating new GitHub issues to track your work
- Adding your issues to our analytics project board
- Viewing all of your issues on the board (e.g. clicking your avatar to filter)

<div style="position: relative; padding-bottom: 62.5%; height: 0;"><iframe src="https://www.loom.com/embed/a7332ee2e1c040edbf2d11da70b4c3ea" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>

(analytics-repo)=

### GitHub Analytics Repo
Expand Down
4 changes: 4 additions & 0 deletions docs/analytics_welcome/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@ After you've read through this section, continue reading through the remaining s

______________________________________________________________________

- [Data and Digital Services Organizational Chart](https://pmp.onramp.dot.ca.gov/downloads/pmp/files/Splash%20Page/org-charts-10-2024/DDS_OrgChart_October2024-signed.pdf)

______________________________________________________________________

**Other Analytics Sections**:

- [Technical Onboarding](technical-onboarding)
Expand Down
11 changes: 10 additions & 1 deletion docs/publishing/sections/5_analytics_portfolio_site.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Netlify is the platform turns our Jupyter Notebooks uploaded to GitHub into a fu
To setup your netlify key:

- Ask in Slack/Teams for a Netlify key if you don't have one yet.
- If you already have your Netlify key set up, find it by typing `cat ~/.bash_profile` into the root of your repo.
- Install netlify: `npm install -g netlify-cli`
- Navigate to your main directory
- Edit your bash profile using Nano:
Expand Down Expand Up @@ -47,14 +48,16 @@ Create a `README.md` file in the repo where your work lies. This also forms the

Each `.yml` file creates a new site on the [Portfolio's Index Page](https://analysis.calitp.org/), so every project needs its own file. DLA Grant Analysis, SB125 Route Illustrations, and Active Transportation Program all have their own `.yml` file.

All the `.yml` files live here at [data-analyses/portfolio/sites](https://github.com/cal-itp/data-analyses/tree/main/portfolio/sites).
All the `.yml` files live here at [data-analyses/portfolio/sites](https://github.com/cal-itp/data-analyses/tree/main/portfolio/sites). Navigate to this folder to create the .yml file.

Here's how to create a `yml` file:

- Include the directory to the notebook(s) you want to publish.

- Name your `.yml` file. For now we will use `my_report.yml` as an example.

- `.yml` file should contain the title, directory, README.md path and notebook path.

- The structure of your `.yml` file depends on the type of your analysis:

- If you have one parameterized notebook with **one parameter**:
Expand Down Expand Up @@ -206,3 +209,9 @@ build_my_reports:
git add portfolio/my_report/district_*/ portfolio/my_report/*.yml portfolio/my_report/*.md
git add portfolio/sites/my_report.yml
```
### Delete Portfolio/ Refresh Index Page
When redeploying your portfolio with new content and there’s an old version with existing files or content on your portfolio site or in your local environment, it’s important to clean up the old files before adding new content.
Use python `portfolio/portfolio.py clean my_report` before deploying your report.
4 changes: 4 additions & 0 deletions docs/publishing/sections/6_metabase.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,7 @@ An [Airflow DAG](https://github.com/cal-itp/data-infra/tree/main/airflow/dags) n
Any tweaks to the data processing steps are easily done in scripts and notebooks, and it ensures that the visualizations in the dashboard remain updated with little friction.

Ex: [Payments Dashboard](https://dashboards.calitp.org/dashboard/3-payments-performance-dashboard?transit_provider=mst)

## Metabase Training Guide 2024

Please see the [Cal-ITP Metabase Training Guide](https://docs.google.com/document/d/1ag9qmSDWF9d30lGyKcvAAjILt1sCIJhK7wuUYkfAals/edit?tab=t.0#heading=h.xdjzmfck1e7) to see how to utilize the data warehouse to create meaningful and effective visuals and analyses.
14 changes: 10 additions & 4 deletions docs/publishing/sections/7_gcs.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,14 @@

# GCS

NOTE: If you are planning on publishing to [CKAN](publishing-ckan) and you are
using the dbt exposure publishing framework, your data will already be saved in
GCS as part of the upload process.
### Public Data Access in GCS

TBD.
Some data stored in Cloud Storage is configured to be publicly accessible, meaning anyone on the internet can read it at any time. In Google Cloud Storage, you can make data publicly accessible either at the bucket level or the object level. At the bucket level, you can grant public access to all objects within the bucket by modifying the bucket policy. Alternatively, you can provide public access to specific objects.

Notes:

- Always ensure that sensitive information is not exposed when configuring public access in Google Cloud Storage. Publicly accessible data should be carefully reviewed to prevent the accidental sharing of confidential or private information.
- External users can't browse the public bucket on the web, only download individual files. If you have many files to share, it's best to use the [Command Line Interface.](https://cloud.google.com/storage/docs/access-public-data#command-line)
- There is a [function](https://github.com/cal-itp/data-analyses/blob/f62b150768fb1547c6b604cb53d122531104d099/_shared_utils/shared_utils/publish_utils.py#L16) in shared_utils that handles writing files to the public bucket, regardless of the file type (e.g., Parquet, GeoJSON, etc.)

NOTE: If you are planning on publishing to [CKAN](publishing-ckan) and you are using the dbt exposure publishing framework, your data will already be saved in GCS as part of the upload process.

0 comments on commit 3060446

Please sign in to comment.