-
Notifications
You must be signed in to change notification settings - Fork 33
Future data provided for California in timeseries.csv #360
Comments
Hi there @jingjtang , thanks for the issue! We have recently converted to a new report and this is a new bug. I'm not sure yet where it comes from, but I'm going to try to solve it now as a few people have noted it. Thank you! jz |
Hm, I just downloaded timeseries-byLocation.json from https://covidatlas.com/data (which links to a file on s3, https://liproduction-reportsbucket-bhk8fnhv1s76.s3-us-west-1.amazonaws.com/v1/latest/timeseries-byLocation.json), and its last date appears to be
timeseries.csv linked on that same page does have the date you mentioned though:
There's a few things to diagnose here, checking. |
baseData.json, which is the base data source for all reports, has future data:
The dateSources in that report shows
Checking that source to see what's up. |
merc news has the following source: https://docs.google.com/spreadsheets/d/1CwZA4RPNf_hUrwzNLyGGNHRlh1cwl8vDHwIoae51Hac/gviz/tq?tqx=out:csv&sheet=timeseries But that source currently has the latest date 07-29:
These return no records: I recently updated merc news, so will check the old implementation to see if it messes up the dates. |
Old code did a bad move with the data. e.g. running with the current data from the site, running scrape gives the following: Still doesn't explain 2020-07-31 showing up, still looking. |
Running It currently is Friday July 31 in a few areas of the world -- Tokyo, for example -- but honestly I'd be surprised if our main running timezone was ahead of us that much! Will check prod log. |
the 2020-07-31 data was updated 2020-07-30T12:38:47.347Z. in dynamodb. That is still 07-30 though, can't see why there would be another date recorded. |
I'm not sure what is happening in the code that is causing this, which doesn't fill me with confidence! The only thought I have here is that the lambda doing the scraping is running in a different timezone, and so assigning a different date. I can't see how it's in such an advanced timezone. Unfortunately our logging is inadequate at the moment, so I can't see how this was set to the future date. Regardless, a fix that I implemented recently should result in the data having the actual date specified in the data files. I'll keep this issue open until we see the change in effect. @jingjtang - I'll assign this to you as well to do the check in a couple of days. I'll check too if I can, though I'm spread thin these days. I'll try clearing out the 07-31 data points for mercury-news, though that's a slow operation. :-) Thanks again @jingjtang for the issue. |
I've also pushed #365 to staging and prod, which had the same forward-dating bug. I believe that this will fix the issue. It may take a couple of days for us to know. |
Dear friends in Corona Data Scraper groups, thank you so much for providing such a source. I am using your data (almost the timeseries.zip) for covid-19 related research. I find there is future data provided for California in the file which confuse me. For example, today is 07-30, but there are case numbers for California 07-31. Is there any mismatches between the cases/deaths/tested and the dates?
The text was updated successfully, but these errors were encountered: