Before running the code, you need to install BeautifulSoup library to allow for scraping of the articles:
pip install beautifulsoup4
To create SmartBook reports for the Ukraine-Russia crisis, you need to first scrape the CNN daily coverage articles. These will be clustered together within SmartBook to identify the major events.
You can run the code to scrape all of CNN's daily coverage within a given time period (start_date and end_date in mm-dd-yyyy
format):
python cnn_ukraine_crawler_json.py --start_date <start_date_here> --end_date <end_date_here> --output_dir <output_dir_here>
Note: For optimal clustering performance, we recommend not using time periods that are more than than 2 weeks apart. For instance, if you want to generate situation reports for a one month period, we recommend running SmartBook separately for two 2-week durations.
The above code creates raw text files in the output_dir
with each article having it's own file and file name corresponding to the article ID. The output_dir
used here should be used as the input_dir
when running the SmartBook code.