Welcome to the srn_docs_stat repository, a tool created to streamline the process of downloading data from the SRN documents database API and performing simple summary statistical analyses on the acquired data. This repository contains a Python Jupyter Notebook for gaining insights into the dataset's characteristics.
The repository provides range of tests and features:
Implement a feature to exclusively download files that are not already present in your local directory, enabling updates as the database continues to evolve.
Identify and compile records of files that were unable to be downloaded, along with accompanying error messages.
Categorize the downloaded files based on their filetypes and optionally append appropriate suffixes to the local file names.
Analyze the distribution of filetypes among the locally downloaded files. Helps to understand the data composition.
indicate connections between missing files and their corresponding companies, offering insights into companies with mainly missing files.
Evaluate the frequency distribution of years within the downloaded dataset, aiding in identifying temporal distribution.
max, min, median and average Pages of .pdf files
The repository's "output" folder contains a snapshot of the tables generated by the code. When running the code, the up-to-date output tables are safed here.
The "data" folder contains the actual documents that are downloaded from the SRN API. By default and before running the code, this folder is empty.
To get started, simply clone this repository and follow the provided instructions in the documentation. This repository seeks to help understand the power and limitations of the SRN documents data.