GitHub - paluchasz/twic_web_crawl: I wanted to download many chess files from the website TWIC so I could add them to my database. I then updated my Python script and used cron to execute the script weekly so that my files are up to date from now on

Downloading TWIC games project

The week in chess website uploads weekly chess games files which I should be adding to my Chessbase Twic database so I have the latest games available. Unfortunately, I haven't done this for about two years and now I have to download more than 100 zip files.

Downloading and unzipping:

I managed to do this using the pycurl library. The urls have a common pattern with only the number increasing by 1 each week so I managed to do this using a for loop. Importantly, the url doesn't contain "www", when I was putting this in I was getting a 301 (moved permanently) error that the url has been changed. So the url looks something like "http://theweekinchess.com/zips/twic1248g.zip" , I want the g.zip ending as I want a pgn file (reason later). With the previous version when I used the command $ time python3 practice_pycurl.py it took over five minutes to download all the zip files. Now I changed the code, by putting c = pycurl.Curl() and c.close() outside the for loop I made sure that the Curl object is only initialised and closed once thus halving the time to two minutes thirty seconds. Finally, used the zip library to unzip each file and the os library to delete the old remaining zip file.

Reason for downloading pgn:

If I had downloaded the cbv chessbase files instead I did not see a way to combine them and thus I would have to append them to my Twic database individually. With the pgn files I can first do $ cat * >twic1143-1250.pgn do merge all the files into one (if I want to merge just two files say then I can do $ cat file1 file2 >file1-2). Now, I wanted to remove all files except this one. To do this I need to first enable pattern-list option with $ shopt -s extglob and then use $ rm -v!(“filename”) command.

Notes:

Documentation for pycurl: http://pycurl.io/docs/latest/quickstart.html

List of http status codes: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

Curl Object: http://pycurl.io/docs/latest/curlobject.html

Documentation for zipfile: https://docs.python.org/3/library/zipfile.html#zipfile-objects

Automating the process:

Firstly, I updated my script. The idea is to have a separate text document with numbers which represent which twic file I need to download. So for each number/line in document the function which downloads pgn will be called and after this was successful the number will be deleted from the text file. Finally, we write a new number for the file that needs to be downloaded next week. It took me ages to work out how I can read and write in a text file...

How do we know whether download ‘was successful’? - Well, in my function to download PGN I added an exception. This would catch a BadZipFile error if the program tried to download a file which was not uploaded to the website yet. I am assuming more things can go wrong but I guess I will find out later?

How did I set up to execute it weekly? - I wrote a cronjob for this. Have to write $ crontab -e (use -l if you want to see a list of cron jobs) in terminal and then I used: 30 12 * * 2 /home/paluchasz/My_projects/downloading_twic_games.py. I also added a shebang line in Python which I am not sure if it is necessary: #!/usr/bin/python3

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
README.md		README.md
downloading_twic_games.py		downloading_twic_games.py
twic_number_to_be_downloaded_next.json		twic_number_to_be_downloaded_next.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Contributors 2

Languages

paluchasz/twic_web_crawl

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages