-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
USP Drug Classification data dictionary + tidying #33
USP Drug Classification data dictionary + tidying #33
Conversation
* Pulling drug use classes out of the CMS PUF files for categorizing the Plan D data * Fixed directory creating, refactored df, shortened loop
* Fix gitignore to ignore XLSX/ZIP and move CMS data to its own dir * Add drug names back into annual spending data, for ease of use * Forgot to add notebook in last commit * Fix exploration notebook after earlier changes * Add .DS_Store files to .gitignore * Remove Medicare drug spending dataset (migrated to data.world) * Remove data in favor of using data.world - (External) Move all data files to data.world repository (https://data.world/data4democracy/drug-spending) - Remove data/ directory - Correct notebook code to work with data.world as a source * Wrote a helper function that gets data from a URL, wrote function that downloads Part D data based on notebook * Added docstring to function * Added more functions that load data wrangle it * Added argument parser and squashed some bugs * Removed dependence on openpyxl, since Pandas does the trick * Notebook runs * Minor change to command line arg and addition to help string * Move comments into Markdown cells and add CSV output * Added functionality to decide between input/output data formats; supports cvs and feather at the moment
Markdown version of goals statement - first draft.
Cleaning drug manufacturer data sourced from CMS.
@cduvallet! The data summary and data dictionary are SO helpful! I've asked Matt or Daniela to review it because I'm not a Python user, but whether or not the data relates to what we're doing immediately, having all this documented so well is fantastic. Thank you! |
I can review this today, unless @mattgawarecki is on it already. |
I can also check tonight if these classifications work for the Part D data that I've been playing around with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! This'll be really useful!
import pandas as pd | ||
|
||
if __name__ == "__main__": | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be just a script to be executable from the command line? Or should this supply a function that can be run from within a larger Python
program as well? If it's the former, you don't really need the if __name__ == "__main__"
line.
I would actually suggest moving the code below into a function tidy_kegg_data()
or something, and then have it execute that here. This would allow someone to run this from within a larger programme, if necessary.
|
||
if __name__ == "__main__": | ||
|
||
fname = 'br08302.keg' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this data on data.world
? It might be worth not assuming that the data exists on the local system, or at least check whether it exists on the local system.
There's a function in scripts/read_data.py
that might make that easier (you might need to git pull upstream/master
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, both great points that I meant to address but forgot! Will have time tomorrow to fix. Thanks for pointing it out! :)
@dhuppenkothen I made the changes you recommended, it's much nicer now. I wasn't sure of the best way to interface with Also, it seems that there are currently two ways we're keeping track of, downloading, and tidying data:
From what I understood from @mattgawarecki, I think we're going with option 2? But let me know if not, and I can incorporate this into the |
…aries Merge data-dictionaries branch in preparation for restructuring
@cduvallet I'll let @dhuppenkothen speak to |
added direct link to the datasets of interest
* Reorganization FTW * Reorganization FTW, part 2 * Add .gitignore * Add READMEs to each subdirectory. Rename data dictionary template (now TEMPLATE) and remove suffix from manufacturer_datadict.md. * Add link to data.world Python client * Update main README to reflect new file structure * Fix link to datadictionaries * Really fix it this time * Fix the other datadictionaries links to overview and template * More streamlining and edits to README
Hey @cduvallet and @dhuppenkothen! Just checking in on the status of this PR. No rush intended on my end, just wanted to make sure there isn't anything blocking either of you that we need to take care of administratively. |
@jenniferthompson Nope, I was just traveling this weekend so haven't gotten around to finalizing this. Will update if I need anything from y'all! :) |
Okay, I think we should be ready to merge! @jenniferthompson double-check and let me know if anything needs to change? |
@cduvallet The data-dictionaries branch looks great! Would you mind pushing that to your master branch so it'll show up on master here? I think that should do it! @dhuppenkothen did you have any further suggestions on the Python code? |
@jenniferthompson I think I did it! Should be ready to merge if @dhuppenkothen doesn't have other comments. |
Looks good to me! |
Oops. I'll get this into |
Continuing on issue #14, finalize the USP Drug Classification data dictionary, etc. Taw and tidy data are on data.world.
This data may or may not be useful - it has non-Medicare Part D medications and their respective classes/categories. The classes and categories are pretty self-explanatory (e.g.
Antidepressants
,Antiparkinson Agents
,Sleep Disorder Agents
) and can likely easily be tied to usage (depending on how we decide to define usage...).Some follow up tasks, if we decide to use this data: