Command-line interface (CLI) program for downloading 10-K, 10-K/A, 10-Q, 10-Q/A filings from the SEC EDGAR database. Note that the program is not optimized for efficiency (e.g., parallelization).
- Setup Python environment (version 3.8.5)
pip install -r requirements.txt
- Download index files from SEC EDGAR for the period
--start
to--end
(write tooutput/index
).
python src/scraping.py download-index --user-agent 'ORG_NAME MAIL_ADDRESS' --start 2004 --end 2022
- Compute number of available filings (write to
output
).
python src/scraping.py count-filings --start 1996 --end 2022 --form-type 10-k
- Download
--form-type
filings (write tooutput/filings/--form-type
). Note: restrict amount of filings per quarter via-N
or set to sufficiently high number to download all available filings, e.g., 32,000 for 8-K, 10,000 for 10-K or 13,000 for 10-Q and extract metadata from all downlaoded filings (write tooutput/filings/--form-type/metadata.csv
).
python src/scraping.py download-filings --user-agent 'ORG_NAME MAIL_ADDRESS' --start 2012 --end 2013 --form-type 10-k -N 10000
- Preprocess filings, i.e., remove markup tags, number-heavy tables, multiple newlines, etc. Note: Cleaned filing overrides the raw filing to save memory on disk. Also, it still contains markup-tags for text-heavy tables ([TABLE] ... [/TABLE]) for debugging purposes. Tags are automatically removed during information extraction in the next step.
python src/parsing.py clean-filings --start 2013 --end 2013 --form-type 10-k
- Extract Item 1 (Business Description) or MD&A (Management Discussion and Analysis) sections from the respective
--form-type
according to flexible, hand-coded regex patterns (write tooutput/filings/--form-type
with respective file suffixes). Note: Item 1 extraction is only applicable to 10-K filings.
python src/parsing.py extract-item1 --start 2020 --end 2020 --form-type 10-k
python src/parsing.py extract-mda --start 2020 --end 2022 --form-type 10-k
- Helper function to sample filings from each quarter for ex post validation after setting a random seed
--seed
(write tooutput/sample
).
python src/utils.py sample-filings --start 2020 --end 2022 --form-type 10-k --section-type item1 -N 4 --seed 2022
- Helper function to gather all section in one large
.txt
file, with each document being delimited by a new-line (write tooutput/filings/--form-type
). Note: Use--min-sec-length
to filter out short and/or corrupt sections due to parsing errors or omittance. Empirically, a minimum sequence length of 2,500 (1,500) for 10-K/10-Q MD&A (Item 1) filters out most of the edge cases.
python src/utils.py gather-sections --form-type 10-k --section-type item1 --min-sec-length 1500
Find below available shorthands as well as argument default values. Check by running:
python edgar_scrape.py [-h | --help]
-h, --help
--user-agent=STR Agent to identify with SEC EDGAR (of the form 'ORG_NAME MAIL_ADDRESS')
--start=INT Start year for scraping [default: 1996].
--end=INT End year for scraping [default: 2020].
--form-type=STR Form type (one of: 10-k, 10-k/a, 10-q, 10-q/a) [default: 10-k].
-N INT, --no-of-filings=INT Number of filings to be sampled per quarter [default: 10].
--seed=INT Random seed for sampling [default: 2020].
--min-sec-length=INT Minimum length of section in characters [default: 2500].
- Accuarcy: 93.25%
- TP: 352
- TN: 19
- FP: 0
- FN: 27
- Accuarcy: 95.54%
- TP: 303
- TN: 16
- FP: 0
- FN: 15
- Accuarcy: 91%
- TP: 363
- TN: 2
- FP: 1
- FN: 34
- Accuarcy: 96.4%
- TP: 323
- TN: 1
- FP: 0
- FN: 12
- Accuarcy: 97.5%
- TP: 370
- TN: 20
- FP: 2 (incl. ToC)
- FN: 8
- Accuarcy: 98.21%
- TP: 314
- TN: 16
- FP: 2 (incl. ToC)
- FN: 4