GitHub - Jaegrqualm/practice-html-scraper: A basic HTML scrapter with hardcoded URLs.

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.vscode		.vscode
.~lock.distrowatchdata-backup.csv#		.~lock.distrowatchdata-backup.csv#
.~lock.distrowatchdata.csv#		.~lock.distrowatchdata.csv#
README		README
distrowatchdata.csv		distrowatchdata.csv
distrowatchdata.csv.backup		distrowatchdata.csv.backup
htmlscrape.py		htmlscrape.py
orderversions.py		orderversions.py
test.py		test.py

Repository files navigation

##README
 
This is a simple HTML scraper written in python.
Currently, it has two URLs hardcoded, one for distrowatch.com and the other for the top 100 distros at distrowatch.com.
Directories and files are also hardcoded.

The aim is to collect the package versions on the pages of the top 100 distros.
Output is a comma-delimited .csv file that is recognizable to most any excel-style program.
The data it outputs is very poorly formatted, and has a lot of extra cells where they shouldn't be, thanks to irregular formatting on distrowatch's part.

To try to amend the fact that we're working with version numbers, there is also a converter from version numbers to ordinal integers.