Skip to content

A repository to download documents and perform simple summary statistics on the downloads from SRN documents database API

Notifications You must be signed in to change notification settings

trr266/srn_docs_stat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SRN Documents Statistics Repository

Overview

Welcome to the srn_docs_stat repository, a tool created to streamline the process of downloading data from the SRN documents database API and performing simple summary statistical analyses on the acquired data. This repository contains a Python Jupyter Notebook for gaining insights into the dataset's characteristics.

Functionality

The repository provides range of tests and features:

1. Selective Downloads

Implement a feature to exclusively download files that are not already present in your local directory, enabling updates as the database continues to evolve.

2. Failed Downloads List

Identify and compile records of files that were unable to be downloaded, along with accompanying error messages.

3. Filetype Classification and Renaming

Categorize the downloaded files based on their filetypes and optionally append appropriate suffixes to the local file names.

4. Filetype Frequency summary

Analyze the distribution of filetypes among the locally downloaded files. Helps to understand the data composition.

5. Missing Files and Company Linkage

indicate connections between missing files and their corresponding companies, offering insights into companies with mainly missing files.

6. Year Frequency Distribution

Evaluate the frequency distribution of years within the downloaded dataset, aiding in identifying temporal distribution.

7. Summary Statistics about .pdf files

max, min, median and average Pages of .pdf files

Output and Data

The repository's "output" folder contains a snapshot of the tables generated by the code. When running the code, the up-to-date output tables are safed here.

The "data" folder contains the actual documents that are downloaded from the SRN API. By default and before running the code, this folder is empty.

Usage

To get started, simply clone this repository and follow the provided instructions in the documentation. This repository seeks to help understand the power and limitations of the SRN documents data.

License


About

A repository to download documents and perform simple summary statistics on the downloads from SRN documents database API

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published