Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ETL for auditability and efficiency #148

Open
adborden opened this issue Aug 29, 2018 · 1 comment
Open

Refactor ETL for auditability and efficiency #148

adborden opened this issue Aug 29, 2018 · 1 comment

Comments

@adborden
Copy link
Member

I'm looking at IL Campaign Finance's ETL as a nice example of how to structure the Makefile so it's a little more intuitive, and only rebuilds what's necesary (instead of downloading or importing everything from scratch).

DataMade also has some guidelines about how to structure ETL pipelines with some stated principles that we should follow (I think we're already most of the way there, but good to be explicit).

  1. Never destroy data - treat source data as immutable, and show your work when you modify it
  2. Be able to deterministically produce the final data with one command
  3. Write as little custom code as possible
  4. Use standard tools whenever possible
  5. Keep source data under version control

I think that if we're more explicit about how we extract and transform the data, it might make it easier to spot bugs, or introduce tests/invariants along the way.

@adborden
Copy link
Member Author

adborden commented Sep 1, 2018

Another principle I would want for our site, is that the data should be easy to get at. E.g. at minimum, we should provide a link to the data source within the UI. In the ideal case, we could provide download link to the data post- tranformations and cleaning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant