-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrape Merck Manuals for drug names and uses #50
Comments
I would love to help out on this one! Are we gathering all information of drugs listed here? |
@domingohui I think that's the goal, but I'm not as well acquainted with this task as some others in the project. I think @TBusen is heading up work on Merck Manual, if I recall correctly. @TBusen, do you see any opportunities to pair with @domingohui on this issue? |
Absolutely! although as you indicated, it's not as easy as just sending a get request and scraping... here's some lessons learned, where I'm at so far and some ideas on what I plan on trying next.
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36" If you don't add this you get a 403 error
To really automate this we need to be able to navigate from the main page to the drug information page. The site expects a browser so I think Selenium or similar is needed. I'm no expert in this area so any ideas would be warmly welcomed. Some sources that might be useful: |
I was able to navigate to the generic to brand name drug page using selenium. See my PR for what I did. Now that this is done the next step is to loop through all the drug name href tags to scrape the pop ups. |
Thanks @TBusen I don't know if you tried this already, but if you navigate to the pro version -> Drug Information -> Drugs by Name, Generic and Brand. When you click on a drug name, the website sends a request - |
that appears to be the return from the consumer page, not the pro page. They seem to have different information returning between the two pages on the same drug. I see in the network inspector where that is coming from and you're right the output is ugly, but I don't think we want the consumer view. |
good news, I think we will be able to get the data from the pop ups. I was able to pull it in by locating the first table element, clicking it and then telling Selenium to look for an element that isn't visible in the body, the pop ups main body's class name is lexi-main, then find all paragraph elements. I haven't looped through it yet since I didn't use drug name. If you want to take my latest commit where the browser code ends chrome.find_element_by_link_text('Drugs by Name, Generic and Brand').click() That generates the table of drug names. @domingohui I'll commit what I have if you want to try and get this to work for drugs in the table taking drug name as an input to the loop. |
Hey all, I just got a reply from Merck regarding our permission to use this data -- it's not great. 😢 See my comment on #14. In light of Merck's respectful declination, should we continue to keep this issue open? |
Under the circumstances, probably best to close it. 😢 |
Closing. |
Task
The Merck Manuals website contains a listing of drugs, mapping generic names to brand names and listing usage indications (i.e., what the drug is prescribed for) with each one. We'd like to gather this data to build on our efforts to map drugs to their uses.
Start here: Merck Manuals Professional Version - Drug Information
This issue was spun off from #14.
Things you should know
The Merck Manuals website defaults to its consumer version. To see the professional version, one must select it explicitly. Hotlinks to the professional version redirect to the consumer version unless this selection is done beforehand. This issue can be circumvented by setting the HTTP
Referer
header to the valuehttp://www.merckmanuals.com/professional
.Retrieving usage indicators for a drug may prove more complex than simply getting its name. Usage indicators are contained in a modal pop-up that appears when the user clicks on a drug name. Because the modal is controlled via JavaScript, the markup containing the desired information may not be visible to a basic "naive" scraper. This modal is a definitive guide to the drug in fine-grained detail, so some substantial text parsing may also be necessary.
What we're looking for
Output from this task should be one or more data files (CSV, feather, or otherwise). In this output, the following information should be recorded for each drug: generic name, brand name, and usage indicator(s).
How this will help
A robust dataset that correlates drugs with the conditions they're used to treat will prove invaluable as we start to dig into Medicare data. With the detail the Merck Manuals provide, we may be able to provide the clearest picture to date as to trends in Medicare drug spending and create snapshots that show how the Medicare population's health has changed over time.
The text was updated successfully, but these errors were encountered: