Scrape IMDb Movies By Searching Name

Movie details scraping in Python3.

.
├── LICENSE
├── README.md
├── Scrape_IMDb.txt                Full scrape data writen in line.
├── add_pic_clarity.py      5-     Improve the sharpness of pictures.
├── content.py              3-     The main scrape script.
├── data.txt                       The final data in a list of lists.
├── excess_log.txt                 Data of getting the title_id in IMDb in line.
├── find_lost.py            4-     A script of finding which movie do not have the summary or poster.
├── get_ttid.py             2-     Use the given movie name to search movie_id in IMDb website.
├── get_url.py              1-     A script of encodeURIComponent() like function.
├── lost_id_1.txt                  Result of find_lost.py
├── movie_id_sort.py        6-     A script of mixing data and sorting.
├── movies.dat                     The original data.
├── searchMovUrlList.txt           The encode URLs.
├── searchMovUrlList_byLine.txt    The encode URLs in line.
└── table.html                     bs4

Install

This project uses python3, requests and Beautiful Soup. Go check them out if you don't have them locally installed.

python -m pip3 install requests
pip3 install beautifulsoup4

Usage

[New update]: Begin from the second step

First step: Get the searching URL

We are given the data set which contains the movie names. In the first step we should use the name to search the matching movies preview in the IMDb website.

$ python3 get_url.py

This script convert the string of movie name and its release year to IMDb's searching format in order to make the result more correct, which means the first searching result in list is what we want.

Second step: Get the movie's title id in IMDb

Use the 'hand-made' URL to search movies in the website. Extract the <a> tag which contains a relative file path to find the sole movie-id.

$ python3 get_ttid.py

We save the data in format to excess_log.txt

Step three: Locate the Summary and Poster

Go to the movies detail website to scrape the summary and poster of the film. Then store the data to Scrape_IMDb.txt

$ python3 content.py

They provide picture host for us.

Step four: Find the lost Summary and Poster

In step three, several movie details like poster can be scraped unsuccessfully because of the ttid is not correct, or other reasons. What we should do is compare the movie_id of excess_log.txt and Scrape_IMDb.txt, output the details to lost_id_1.txt in format like excess_log.txt.

$ python3 find_lost.py

Then change the ttid manually, and use the new output file repeat Step three and Step four!

Step five: Improve the sharpness

The poster is a preview and not clearly enough. Use the script to improve the sharpness of images by modify the picture's URL.

$ python3 add_pic_clarity.py

The final step: Sort data

The step four would add many new data which are unordered. In this step, sorting the data and get them in a list of lists.

$ python3 movie_id_sort.py > data.txt

Well done! Enjoy our script!:satisfied:

Input and Output

Step 2: get_ttid.py

input	output	log	network	data loss
movies.dat	excess_log.txt	err_log.txt	⭕	lost ‼️

Step 3: content.py

input	output	log	network	data loss
excess_log.txt	Scrape_IMDb.txt	scrape_err_log.txt	⭕	lost ‼️

Step 4: find_lost.py

input	output	log	network	data loss
excess_log.txt & Scrape_IMDb.txt	lost_id_1.txt	NULL	do not need	maybe ‼️

Step 5: add_pic_clarity.py

input	output	log	network	data loss
Scrape_IMDb.txt	Scrape_IMDb_pic_clarity.txt	NULL	do not need	no

The last step: movie_id_sort.py

input	output	log	network	data loss
Scrape_IMDb_pic_clarity.txt	data.txt	NULL	do not need	no

Contributors

Thanks to all the people who contribute. Feel free to dive in! Open an issue or submit PRs. @wwyqianqian @Darren2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrape IMDb Movies By Searching Name

Install

Usage

[New update]: Begin from the second step

First step: Get the searching URL

Second step: Get the movie's title id in IMDb

Step three: Locate the Summary and Poster

Step four: Find the lost Summary and Poster

Step five: Improve the sharpness

The final step: Sort data

Input and Output

Step 2: get_ttid.py

Step 3: content.py

Step 4: find_lost.py

Step 5: add_pic_clarity.py

The last step: movie_id_sort.py

Contributors

License

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
LICENSE		LICENSE
README.md		README.md
Scrape_IMDb.txt		Scrape_IMDb.txt
add_pic_clarity.py		add_pic_clarity.py
content.py		content.py
data.txt		data.txt
excess_log.txt		excess_log.txt
find_lost.py		find_lost.py
get_ttid.py		get_ttid.py
get_url.py		get_url.py
handle_data_movie.py		handle_data_movie.py
handle_data_rating.py		handle_data_rating.py
handle_data_user.py		handle_data_user.py
label.py		label.py
lost_id_1.txt		lost_id_1.txt
movie_id_sort.py		movie_id_sort.py
movie_label_to_mysql.csv		movie_label_to_mysql.csv
movie_to_mysql.csv		movie_to_mysql.csv
movies.dat		movies.dat
rating_to_mysql.csv		rating_to_mysql.csv
ratings.dat		ratings.dat
searchMovUrlList.txt		searchMovUrlList.txt
searchMovUrlList_byLine.txt		searchMovUrlList_byLine.txt
table.html		table.html
user_to_mysql.csv		user_to_mysql.csv
users.dat		users.dat

License

CCNU-internship-Dec2020/Scrape-IMDb-By-Searching-Name

Folders and files

Latest commit

History

Repository files navigation

Scrape IMDb Movies By Searching Name

Install

Usage

[New update]: Begin from the second step

First step: Get the searching URL

Second step: Get the movie's title id in IMDb

Step three: Locate the Summary and Poster

Step four: Find the lost Summary and Poster

Step five: Improve the sharpness

The final step: Sort data

Input and Output

Step 2: get_ttid.py

Step 3: content.py

Step 4: find_lost.py

Step 5: add_pic_clarity.py

The last step: movie_id_sort.py

Contributors

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages