-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore and/or tidy the FDA_NDC_Product dataset #70
Comments
Hi, I'm interested in this task. I'm new to this project and online collab in general, but not to cleaning/analyzing data sets for offline work, think it would be a good fit? |
Hey, welcome! Sounds like you'd definitely be a good fit with those skills, and collaborating on github is fairly straightforward after some time using it. The D4D github playground repo is a great resource to get started, if you haven't checked it out yet and would like some practice. Is there a particular part of this task that you'd like to start with? Or any questions that I can answer? |
Thanks, I'm excited to help out! I haven't seen the playground yet, thanks for the heads up, I'll check that out for sure. To start I think I could Tidy the dataset without too much difficulty, I'm familiar with basic tidyverse stuff and read through the attached reference page and "Principles of Tidy Data". Are there any particular tools or other standard that you would prefer I use? Beyond that I would like to at least address the "Potential questions" listed here. I think I saw somewhere that you prefer Jupyter or R Markdown for exploration/analysis, would that be appropriate here and is there a standard format I should follow? |
Congrats on your first commit and pull request! 👍 Thanks for your help, and feel free to use whichever tools you're most comfortable with. Please just save the tidy result to a .csv file, which makes for easy uploading to data.world where the datasets for this project live. A Jupyter notebook or an R markdown file for any analysis work would be great. There's not really a standard format, the main thing I would ask for is please add in comments or a quick explanation here and there to make it easier for others to follow along with what you've done, but that's about it. Would also be appreciated if you could knit the file to HTML (if you use R markdown) or save the jupyter notebook in such a way that the outputs of your work are shown (I'm not entirely sure how to do this, jupyter is not my strong suit haha) so that it's easy to review what was done without rerunning the notebook/cells. Please save the files to the appropriate language folder in the repo (R or Python), either in the datawrangling, analysisviz, or notebook folders (these have to be reorganized a bit anyway, so doesn't matter so much where it goes). Otherwise, it's really up to you how you want to do this. |
Great, thanks! |
I had forgotten to ask, but have you joined the D4D slack and the #p-drug-spending channel there? |
I have, I'm "peter" on the channel. |
Status
FDA_NDC_Product.csv
dataset and is planning on tackling the exploration questions below.Task
Explore and/or possibly tidy the
FDA_NDC_Product.csv
dataset, found on data.worldData dictionary: https://github.com/Data4Democracy/drug-spending/blob/master/datadictionaries/FDA_NDC_Product.md
Tidy format reference: https://ramnathv.github.io/pycon2014-r/explore/tidy.html
What we're looking for
Tidying:
- Main columns that need tidying are:nonproprietaryname
,substancename
,active_numerator_strength
, andactive_ingred_unit
. All of these columns may contain multiple values in one cell if a drug has multiple active ingredients. Ideal format is a separate row for each active ingredient (but other suggestions on better formatting are welcome)Check ifnonproprietaryname
andsubstancename
have the same values (after converting both to matching format)fda_ndc_product_tidy.csv
on data.world.Potential questions for exploration:
FDA_NDC_Product,csv
dataset. Try matchingproprietaryname
from theFDA_NDC_Product.csv
dataset to thedrugname_brand
column in thespending_201x.csv
datasets to address this question. After matching, what is the relationship between thenonproprietaryname
and/orsubstancename
columns from theFDA_NDC_Product.csv
dataset to thedrugname_generic
column in thespending_201x.csv
datasets.How this will help
An option for matching the drugs in the Medicare spending datasets to therapeutic uses is to do so by active ingredients. It seems that the
drugname_generic
column in the spending datasets should be the main active ingredient that can be used for matching, but I had not considered drugs that may have multiple active ingredients (althoughdrugname_generic
appears to list multiple compounds for some drugs, so it may also include the active ingredients). TheFDA_NDC_Product.csv
dataset seems to contain a comprehensive list of the active ingredients in these drugs. A tidying of the FDNA_NDC_Product dataset and a comparison between this dataset and the Medicare spending datasets is an important step in accurately matching drugs to therapeutic uses.The text was updated successfully, but these errors were encountered: