Skip to content

Conversation

@relud
Copy link
Collaborator

@relud relud commented Feb 14, 2019

No description provided.

@relud relud requested a review from jklukas February 14, 2019 00:36
@relud
Copy link
Collaborator Author

relud commented Feb 14, 2019

cc @mreid-moz

@jklukas requesting your review because you said in a meeting recently you would be working on BigQuery ETL related stuff

Copy link
Contributor

@jklukas jklukas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking forward to seeing this come together and the particular jobs in this PR are directly relevant to the growth dashboard work I'm doing. It looks like clients_last_seen is exactly what I need to provide efficient dashboarding by dimension.


- Should name sql files like `sql/destination_table_with_version.sql` e.g.
`sql/clients_daily_v6.sql`
- Should not specify a project or dataset in table names to simplify testing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know at this point what the hierarchy of projects, datasets, and tables is going to look like? Will these derived tables live in the same project and dataset as the source data?

With GCP ingestion so far, we're splitting tables to different datasets based on document namespace. We would need to change that practice to meet this requirement.

There are implications for permissions, testing, etc. that I haven't fully thought through yet.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't know, and dataset per document namespace seems good to me. this has lots of implications, but if we can avoid depending on a static dataset name then we only need unique datasets per test instead of unique projects in order to run tests in parallel.

i think this is fine for queries that only read one input table (hence should not must), because output dataset can be specified separately from default dataset. For queries that need to read multiple tables from multiple datasets, I think for now we can just assume their either run in series or require multiple projects. The first time we need that we can consider solutions like templating dataset names for testing and adding a recommendation here to follow the chosen solution

ARRAY_AGG(input
ORDER BY submission_date_s3
DESC LIMIT 1
)[OFFSET(0)].* EXCEPT (submission_date_s3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fascinating. I like this better than having to use a ROW_NUMBER window function and then select n = 1.

Copy link
Collaborator Author

@relud relud Feb 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could alternately ANY_VALUE(LAST_VALUE(input) OVER (PARTITION BY client_id ORDER BY submission_date_s3)), but i don't know the performance implications of that

Copy link
Collaborator Author

@relud relud Feb 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but i don't know the performance implications of that

I decided to check and it's not as simple as above, but using a window is so much faster it hurts (runs in ~1/6th of the time and uses ~1/8th of the compute)

* EXCEPT (submission_date,
generated_time)
FROM
analysis.last_seen_v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the dataset prefix here intended?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops no

Copy link
Collaborator Author

@relud relud Feb 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i take that back, this one is needed because it won't match the dataset on line 6 above. i will figure out how to make this better as i test it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants