Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More flexible data access #726

Open
matschaffer opened this issue Aug 19, 2020 · 4 comments
Open

More flexible data access #726

matschaffer opened this issue Aug 19, 2020 · 4 comments

Comments

@matschaffer
Copy link
Contributor

matschaffer commented Aug 19, 2020

We get occasional requests for data formatted differently than our bulk exports. For example:

This has also uncovered some lingering data quality issues:

My original thought was to have more people use elasticsearch directly (https://github.com/Safecast/safecastapi/wiki/Data-Sets#kibana--elasticsearch-access).

But the CSV export support is not great. And folks asking for the data seem to be much more familar with postgres/postgis.

I also had hoped to do more with S3 & Athena in this space, but as far as I can tell it has no support for linear distance queries, only cartesian distance (radiation within 100 units would be different meters depending on how far north/south you are).

And finally there was hope that postgres replicas could help us here, but (0) they don't support temp tables (1) they can't be made public and (2) hard queries cause replication lag and ultimately fail out.

Opening this to brainstorm ideas about how we could more easily provide a clean data set in a flexible format people are generally familiar with.

Some ideas:

  • A nightly job that copies/packages data into a public RDS snapshot (or a public RDS instance)
  • Some data cleanup (or at least labeling) for quality issues stemming from known bugs
  • A public postgres proxy of some kind
  • Working out how people can easily access it from https://jupyter.org/ or https://www.r-project.org/
@matschaffer
Copy link
Contributor Author

cc recent requesters for feedback: @sakshamg94 @shmcminn @julovi @dobrych

@matschaffer
Copy link
Contributor Author

Also just noting that this came up in our sync meeting today https://s3-us-west-2.amazonaws.com/safecastdata-us-west-2/meetings/api/2020-08-19-api-sync.mp4 (~26:50 mark) in light of @jamoross mentioning the 10 year anniversary is approaching in ~6months.

@matschaffer
Copy link
Contributor Author

Noting that open qa seems to lead with S3+Athena https://docs.openaq.org/

Happy to see I'm not the only one excited about this avenue for cheap data access :)

@matschaffer
Copy link
Contributor Author

https://github.com/openaq/oh-snap might help if we want to do a public RDS snapshot (though sounds like no-one is really using the one openaq provides)

https://github.com/openaq/fetches-optimizer/pulls also has info on how they're building their parquet tables which might be useful for us as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants