This project is about programatically analyzing the front pages of newspapers.
It consists of two sections:
- The code to download the current day's newspapers from the the Newseum website, and parse out text, bounding boxes, font sizes, and font faces.
- The analyses (mostly Jupyter notebooks) of the resulting data. See the
analysis/
subdirectory for more information.
For desired contributions, see WISHLIST.md
:).
Requirements: Python 3, node, bash, qpdf, jq
- qpdf:
brew install qpdf
- jq:
brew install jq
First run sudo apt-get update
- qpdf:
sudo apt-get install qpdf
- jq:
sudo apt-get install jq
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt
psycopg2
is only necessary as a dependency if writing to postgres.
ipython
is there because I use it for development.
npm install
./runDaily.sh
<-- This downloads the front page of today's newspapers into date/[date]/
and performs the extractions, and loads them into a default postgres database. Note, you will need to run createdb frontpages
to use the default settings.
(What runDaily.sh
is doing:)
./download.sh
<-- This downloads the front page of today's newspapers intodata/[date]/
./decrypt.sh
<-- This runs all the pdfs through a passwordless decrypt (automatically done by pdf viewers), and deletes the original pdfs../parse.sh
<-- This extracts xml files from the decrypted pdfs, and saves them in the same data directory.python parse_xml_to_db.py XML_FILES_DIR OUTPUT_DB OUTPUT_TABLE
, whereXML_FILES_DIR
is thedata/[date]
directory above. This aggregates the XML output from the earlier steps up to the pdf textbox level, and saves it into a database.python ingest_newspaper_metadata.py METADATA_FILE DB_URI OUTPUT_TABLE
whereMETADATA_FILE
isdata/[date]-metadata.json
-- a file output bydownload.sh
. This writes metadata about newspapers that haven't been seen yet to a database table.
Inside of the analysis/
directory, there is a separate README for how to setup the notebook to play with the analyses.