Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

homework 1 scraping URL question #6

Open
melody1126 opened this issue Jan 19, 2022 · 1 comment
Open

homework 1 scraping URL question #6

melody1126 opened this issue Jan 19, 2022 · 1 comment

Comments

@melody1126
Copy link

when we use a weblink that is not open to the public (like Wikipedia), but requires login (like JSTOR, or any of the databases on the UChicago Library site), the link contains something like "proxy.uchicago.edu" and scraping returns the following:

"Shibboleth Authentication Request
If your browser does not continue automatically, click ..."

how can we go around this?

@JunsolKim
Copy link

Collecting data from websites that require login can be complicated. A common way is to use the Selenium package.
For instance, the following code automatically login to GitHub using Selenium. This allows you to access and collect all contents in GitHub that require login (Note that Selenium package may not work well on Google Colab).

[1] Install Selenium and chromedriver

pip install selenium
brew install chromedriver

This is for Mac users, Windows users should run pip install selenium and manually download chromedriver (See https://chromedriver.chromium.org/home)

[2] Run the following Python code

from selenium import webdriver
driver = webdriver.Chrome(executable_path='/opt/homebrew/bin/chromedriver')
driver.get('https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2FUChicago-Computational-Content-Analysis%2FFrequently-Asked-Questions')
driver.find_element_by_id('login_field').send_keys('Your GitHub ID')
driver.find_element_by_id('password').send_keys('Your GitHub PW')
driver.find_element_by_id('password').send_keys(Keys.ENTER)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants