-
Notifications
You must be signed in to change notification settings - Fork 16
pipeline for Quidel flu test #181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
After switching to James's new mapping file, 31 zip codes have no mapping information still: Only 7,583 tests out of 7,519,726 are related to those zip codes until 2020-0803 |
@jsharpna helped check those zip codes. They are not valid zip codes according to https://tools.usps.com/zip-code-lookup.htm?citybyzipcode. Will ask Quidel about them. |
Will email Quidel with all problems: bad zips, non-unique regions per device. Fixing some of these requires merging or otherwise depending on #137, but that package doesn't include the home-state mappings for HRRs and MSAs that are used to fill in for insufficient sample size. Hold off on finishing this until we can get the home-state mappings into the geo package. |
overall_total.drop(labels="FluA", axis="columns", inplace=True) | ||
|
||
# Compute numUniqueDevices | ||
numUniqueDevices = df.groupby( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
snake case var names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly auto-fixable by linter
|
||
|
||
def raw_tests_per_device(devices, tests, min_obs): | ||
''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double quotes
@@ -0,0 +1,39 @@ | |||
# -*- coding: utf-8 -*- | |||
"""Function to export the dataset in the format expected of the API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nitpick but standardizing docstrings/general linting if going one step further can be nice for organization and readability.
I've mainly used flake8 but looks like pylint is common on this repo. I imagine they're comparable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run through black
, probably
zipcode = int(float(zipcode)) | ||
zipcode5.append(zipcode) | ||
df['zip'] = zipcode5 | ||
# print('Fixing %.2f %% of the data' % (fixnum * 100 / len(zipcode5))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this debugging? do the fixnum
lines need to exist still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used for checking only. Temporarily I still want it to be there, since Quidel might change their raw data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
zipcode5 = [] | ||
fixnum = 0 | ||
for zipcode in df['ZipCode'].values: | ||
if isinstance(zipcode, str) and '-' in zipcode: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do mixed types get read into the DF which is why this if/else exists? if so, is it worth reading everything in as str
? if not, and the else
isn't for nans, I'm unsure why the isinstance
exists .
Also I think there might be a way to do this quicker with zfill
like str(zipcode).split("-")[0].zfill(5)
, though not sure without knowing exactly what raw input looks like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. int and strings at length of 5 ("XXXXX-XXXX") both exist for "ZipCode" in raw data from Quidel. The reason that I don't read it in str is because we won't report the data in zip code level. Zip Codes are only used for geo mapping. It is easier that we read it as int and then merge the data with map_df which also has zip codes as type int.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
else: | ||
pooled_positives = tpooled_positives | ||
pooled_tests = tpooled_tests | ||
## STEP 2: CALCULATE AS THOUGH THEY'RE RAW |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is STEP 2 since the geo pooling had a STEP 1 in it, but it's a bit confusing since then STEP 1 is somewhere else.
Co-authored-by: chinandrew <[email protected]>
zipcode5.append(int(zipcode.split('-')[0])) | ||
fixnum += 1 | ||
else: | ||
zipcode = int(float(zipcode)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zipcode = int(float(zipcode)) | |
zipcode = int(zipcode) |
pretty sure this works
@amartyabasu, waiting on your review |
I'll have it completed today. |
EXPORT_DAY_RANGE = 40 # Number of dates to report | ||
|
||
GEO_RESOLUTIONS = [ | ||
# "county", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the county based aggregation not done because of small sample size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. There are few counties available with sample sizes larger than 50.
quidel_flutest/params.json.template
Outdated
"account": "[email protected]", | ||
"password": "", | ||
"sender": "", | ||
"mode":"", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
"mode":""
Extra comma in the end. -
I ran the pipeline with pull_start_date: "2020-07-01" and export_start_date: "2020-06-01". The daily csvs got generated from 20200711 onwards. Does that mean there was no data from 2020-07-01 to 2020-07-10?
-
According to the implementation would the export_start_date always precede pull_start_date to account for the backfills?
-
The 'flu_ag_smoothed_tests_per_device' signal does not report standard errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Remember we only report a geo_id with sample_size larger than 50. There will be data from 2020-07-01 to 2020-07-10, but they might not have a single geo_id with sample sizes larger than 50.
- Yes. export_start_date should always precede pull_start_date
- Yes. Not sure the definition of se for that signal.
|
Linter test:
Pytest:
|
res_group = res_group.merge(parent_group, how="left", | ||
on="timestamp", suffixes=('', '_parent')) | ||
res_group = res_group.drop(columns=[res_key, "state_id", "state_id" + '_parent']) | ||
except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion a simpler if/else block would work better in place of the 'try/catch' block when parent_group does not exist.
…in params.json.template
How did you conduct this linter test where you got those info? |
I simply ran pylint over delphi_quidel_flutest module. |
Weird, I didn't see those results. Could you try |
I got 10/10 on my computer with: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we're still receiving the source data for this, we are interested in starting to report it! (Although it will be internal-only, like the Quidel covid data.)
A fair amount of the logic in here has since been moved to delphi_utils
(export_csv
and geo_map
). Other stylistic choices are out of date. It's unclear right now how much of the logic is the same between this and the quidel covid indicator. If they are similar, we could just copy the covid code over and modify names/connection info, rather than updating all of this code.
quidel_covidtest
has age breakdowns of signals. Are those available for flu tests too? If those are easy to add, we should add them.
MIN_OBS = 50 # minimum number of observations in order to compute a proportion. | ||
POOL_DAYS = 7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be in constants
@@ -0,0 +1,39 @@ | |||
# -*- coding: utf-8 -*- | |||
"""Function to export the dataset in the format expected of the API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run through black
, probably
@@ -0,0 +1 @@ | |||
*.csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy gitignore from quidel_covidtest. this is probably missing files/dirs
python -m venv env | ||
source env/bin/activate | ||
pip install ../_delphi_utils_python/. | ||
pip install . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update with make
commands
from . import geo_maps | ||
from . import data_tools | ||
from . import generate_sensor | ||
from . import export | ||
from . import pull | ||
from . import run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not current style. also some of the functionality here has been moved to delphi_utils
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove. now in geomapper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove. now in geomapper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to test xlsx files? Do we get some input data in that format?
quidel_flutest/.pylintrc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this file appear in other (newer) indicators?
``` | ||
p = 100 * X / N | ||
``` | ||
If N < 50, we lend 50 - N fake samples from its home state to shrink the estimate to the state's mean, which means: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, we only do this in other indicators if N <50 AND > 30. Check if this also applies here
Some decisions to make:
A mapping problem at 5-digit zip code level:
This problem is not severe in COVID test. There is only <10 zip codes that are not included in 02_20_uszips.csv and a very small proportion of data is related to those wired zip codes.
However in Flu test, there are ~90 such zip codes. Hard to manually check each one and fill in their mapping and population information. May need to update our mapping file?
These zip codes listed here:
{603, 622, 627, 674, 676, 683, 717, 726, 728, 732, 733, 736, 738, 754, 780, 792, 795, 907, 912, 919, 953, 957, 959, 2572, 2781, 15705, 20174, 27412, 27460, 28793, 28823, 29019, 29484, 29486, 29871, 30597, 30997, 32163, 32214, 32306, 32313,
32611, 32761, 33551, 33574, 33652, 35642, 37232, 47782, 48483, 48670, 48824, 48902, 50410, 60944, 68179, 72053,
75033, 75072, 75222, 75322, 75429, 75546, 75606, 76094, 76803, 76909, 76992, 76993, 77370, 77399, 78086, 78776,
79430, 80630, 84129, 85378, 86123, 86746, 89557, 91315, 92094, 92152, 92521, 92697, 93077,
95929, 99094, 99623}
Only 133,000 tests out of 7,519,726 are related to those zip codes until 2020-0803
(Remember to remove wip_ and change the pull_start_date to be earlier than 2020-05-08, it will take about half an hour to read all of the historical data)