-
Notifications
You must be signed in to change notification settings - Fork 125
version queries and add clients_last_seen and mau28_by_dimensions #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @mreid-moz @jklukas requesting your review because you said in a meeting recently you would be working on BigQuery ETL related stuff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking forward to seeing this come together and the particular jobs in this PR are directly relevant to the growth dashboard work I'm doing. It looks like clients_last_seen is exactly what I need to provide efficient dashboarding by dimension.
|
|
||
| - Should name sql files like `sql/destination_table_with_version.sql` e.g. | ||
| `sql/clients_daily_v6.sql` | ||
| - Should not specify a project or dataset in table names to simplify testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know at this point what the hierarchy of projects, datasets, and tables is going to look like? Will these derived tables live in the same project and dataset as the source data?
With GCP ingestion so far, we're splitting tables to different datasets based on document namespace. We would need to change that practice to meet this requirement.
There are implications for permissions, testing, etc. that I haven't fully thought through yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't know, and dataset per document namespace seems good to me. this has lots of implications, but if we can avoid depending on a static dataset name then we only need unique datasets per test instead of unique projects in order to run tests in parallel.
i think this is fine for queries that only read one input table (hence should not must), because output dataset can be specified separately from default dataset. For queries that need to read multiple tables from multiple datasets, I think for now we can just assume their either run in series or require multiple projects. The first time we need that we can consider solutions like templating dataset names for testing and adding a recommendation here to follow the chosen solution
| ARRAY_AGG(input | ||
| ORDER BY submission_date_s3 | ||
| DESC LIMIT 1 | ||
| )[OFFSET(0)].* EXCEPT (submission_date_s3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fascinating. I like this better than having to use a ROW_NUMBER window function and then select n = 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could alternately ANY_VALUE(LAST_VALUE(input) OVER (PARTITION BY client_id ORDER BY submission_date_s3)), but i don't know the performance implications of that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but i don't know the performance implications of that
I decided to check and it's not as simple as above, but using a window is so much faster it hurts (runs in ~1/6th of the time and uses ~1/8th of the compute)
sql/clients_last_seen_v1.sql
Outdated
| * EXCEPT (submission_date, | ||
| generated_time) | ||
| FROM | ||
| analysis.last_seen_v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the dataset prefix here intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops no
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i take that back, this one is needed because it won't match the dataset on line 6 above. i will figure out how to make this better as i test it.
Remove bigquery-etl resolution
No description provided.