This script can be used to merge scientific paper datasets of .csv
format whilst keeping track of where duplicates originate from. Besides a merged output file, the script also produces a copy of the subset of papers which do not have an abstract, and a .xlsx
file describing the number of unique records from each input set and the number of records found for every combination of input sets.
- Data Merging: Combines multiple datasets into a single file.
- Source Column Filling: Marks the source file for each data entry.
- Duplicate Removal: Identifies and removes duplicate entries based on titles, abstracts, and DOIs.
- Annotation File Creation: Generates files for missing abstracts with columns for annotator judgments.
- Statistics Generation: Outputs statistics about the input and merged datasets.
- Python 3.x
- Libraries:
pandas
,numpy
,matplotlib
,argparse
,warnings
,pathlib
,re
,sys
- ASReview (
asreview
) for handling systematic review data
Before running the script, ensure you have Python installed and the required libraries. You can install the required Python libraries using pip:
pip install pandas numpy matplotlib argparse asreview
To use the script, you must have Python installed on your system. The script is executed from the command line with the output file name and the list of input files as arguments.
python foras_project.py <output_file> <input_file1> <input_file2> ... <input_fileN>
python foras_project.py merged_output.csv dataset1.csv dataset2.csv dataset3.csv
This command merges dataset1.csv
, dataset2.csv
, and dataset3.csv
into a new file named merged_output.csv
.
merge(output_file, input_files)
: Main function to merge input files into a single output file.output_annotation_df(annotation_df, annotators, output_file)
: Creates annotation files for missing abstracts.fill_source_columns(dataframe, column_names)
: Fills source columns in the merged dataframe.assign_mother_id(df)
: Assigns a unique mother ID to each entry.clean_columns(dataframe)
: Cleans up unnecessary columns from the dataframe.count_unique_records(dataframe)
: Counts unique records in the merged dataset.source_counts(dataframe, output_file)
: Counts the number of sources for each entry.duplicated(asrdata, pid='doi')
: Identifies duplicate entries based on DOI or text content.drop_duplicates(asrdata, pid='doi', inplace=False, reset_index=True)
: Removes duplicate entries from the dataset.main()
: Entry point of the script.
- Merged Dataset: A CSV file containing the merged records.
- Missing Abstracts File: Excel files for each annotator with records missing abstracts.
- Statistics File: A text file containing statistics about the inupt datasets and the merged dataset.
The FORAS Project preprocessing script is designed for cleaning and standardizing datasets before they are used for systematic reviews or data analysis. This script focuses on generalizing column names, cleaning data, and preparing datasets for further processing. It supports CSV input files and outputs the processed data to a specified CSV file.
- Column Generalization: Renames columns to standard names for titles, years, and DOIs.
- Label Preparation: Prepares the label column by renaming and cleaning values.
- DOI Cleaning: Cleans and standardizes DOI column values.
- Column Cleaning: Removes unnecessary columns from the dataset.
- Python 3.x
- Libraries:
pandas
,numpy
,argparse
,warnings
,pathlib
,sys
,re
- ASReview (
asreview
) for handling systematic review data configurations
Ensure you have Python installed on your system along with the required libraries. You can install the necessary Python libraries using pip:
pip install pandas numpy argparse asreview
This script is executed from the command line, taking the input file name and the output file name as arguments.
python preprocess_script.py <input_file.csv> <output_file.csv>
python preprocess_script.py dataset.csv cleaned_dataset.csv
This command will preprocess dataset.csv
and save the cleaned and standardized data to cleaned_dataset.csv
.
preprocess(file, output_file)
: Main function for preprocessing the input CSV file.clean_columns(dataframe)
: Removes unnecessary columns from the dataframe.generalize_title_column(dataframe)
: Standardizes the column name for titles.generalize_year_column(dataframe)
: Standardizes the column name for years.generalize_doi_column(dataframe)
: Standardizes the column name for DOIs.prepare_label_column(dataframe)
: Prepares the label column by renaming and cleaning.clean_doi_column(dataframe)
: Cleans DOI column values.main()
: Entry point of the script.
The original dataset (dataset.csv
) might have entries with varying column names, missing data, and unstandardized formats. For example:
Title | Pub Year | record DOI | Included | Unnamed: 4 |
---|---|---|---|---|
Sample Research Article | 2020 | https://doi.org/10.1000/journal.2020 | 1 | NaN |
Another Research Article | 2019 | 10.1001/journal.2019 | -1 | NaN |
- Column names are inconsistent (e.g., "Pub Year" instead of "year").
- The DOI column includes prefixes and is not lowercase.
- The "Included" column contains
-1
for indicating exclusion, which is not standard. - There is an unnecessary column ("Unnamed: 4").
After running python preprocess_script.py dataset.csv cleaned_dataset.csv
, the dataset is cleaned and standardized:
title | year | doi | included_dataset.csv |
---|---|---|---|
Sample Research Article | 2020 | 10.1000/journal.2020 | 1 |
Another Research Article | 2019 | 10.1001/journal.2019 |
- Column names are standardized to "title", "year", "doi", and "included_dataset.csv" to reflect the source file and inclusion status.
- The DOI is cleaned to remove the URL prefix and is all lowercase.
- The "Included" column is renamed to "included_dataset.csv" and
-1
values are replaced with blanks to indicate exclusion more clearly. - Unnecessary columns are removed.
This example demonstrates how the preprocessing script standardizes dataset entries, making them cleaner and more consistent for subsequent analysis or systematic review processes.
- Processed Dataset: A CSV file containing the preprocessed dataset with standardized column names and cleaned data.
The FORAS Project serving script is designed to facilitate the sampling and distribution of records for annotation by multiple annotators. This script takes a dataset, filters out previously annotated records, and randomly selects a specified number of new records to be annotated. Each annotator receives a personalized Excel file with the records to annotate, ensuring an efficient and organized annotation process.
- Prior Calibration Filtering: Excludes records already annotated in previous batches.
- Random Sampling: Randomly selects a specified number of records for annotation.
- Annotation File Generation: Generates personalized Excel files for each annotator with the records to be annotated.
- Important Column Preservation: Retains only essential columns such as title, abstract, DOI, and MID.
- Python 3.x
- Libraries:
pandas
,numpy
,argparse
,warnings
,pathlib
,sys
- ASReview (
asreview
) for handling systematic review data configurations
Ensure you have Python installed on your system along with the required libraries. You can install the necessary Python libraries using pip:
pip install pandas numpy argparse asreview
This script is executed from the command line, taking the dataset file, a prior calibration file, the number of records to sample, and the list of annotators as arguments.
python serve_script.py <dataset_file.csv> <prior_calibration_file.xlsx> <n_records> <annotator1> [<annotator2> ...]
python serve_script.py dataset.csv prior_annotations.xlsx 100 JohnDoe JaneDoe
This command will sample 100 records from dataset.csv
, excluding any records found in prior_annotations.xlsx
, and generate JohnDoe.xlsx
and JaneDoe.xlsx
for the annotators John Doe and Jane Doe respectively.
serve(file, prior_calibration_file, n_records, annotators)
: Main function for sampling and serving records for annotation.output_annotation_df(annotation_df, annotators)
: Generates personalized annotation Excel files for each annotator.sort_by_date(df)
: Placeholder function for future implementation to sort records by date.sample(df, n_records, prior_mid)
: Samples a specified number of records, excluding those with MIDs found in the prior calibration file.main()
: Entry point of the script, parsing command line arguments and initiating the serving process.
- Annotator Excel Files: Each annotator specified in the command line arguments will receive an Excel file named after them containing the records to annotate. These files include columns for recording annotations related to title eligibility, inclusion criteria, and any other exclusion reasons.
The FORAS Project Serve Vocabulary script is specifically designed for the early calibration phase of annotation. Its primary function is to provide annotators with a diverse set of records from a dataset to examine the evolution of terms and concepts over time. The script selects 100 of the oldest records, 100 random records, and 100 of the newest records from the dataset, aiming to capture the variance in vocabulary and thematic focus across different periods.
- Vocabulary Calibration: Helps identify changes in terms and concepts throughout the years within a dataset.
- Targeted Record Selection: Selects records from three distinct time periods: oldest, random, and newest.
- Exclusion of Previously Annotated Records: Ensures that records previously used in calibration are not included again.
- Annotator-Specific Files: Generates personalized Excel files for each annotator with the selected records.
- Python 3.x
- Libraries:
pandas
,numpy
,argparse
,warnings
,pathlib
,sys
- ASReview (
asreview
) for systematic review data configurations
Ensure Python and the required libraries are installed on your system. Install the necessary Python libraries using pip:
pip install pandas numpy asreview
This script is executed from the command line, accepting the dataset file, a prior calibration file, the number of records to sample, and the list of annotators as arguments.
python serve_vocabulary_script.py <dataset_file.csv> <prior_calibration_file.xlsx> <n_records> <annotator1> [<annotator2> ...]
python serve_vocabulary_script.py dataset.csv prior_annotations.xlsx 100 JohnDoe JaneDoe
This command will process dataset.csv
, excluding records found in prior_annotations.xlsx
, and generate JohnDoe.xlsx
and JaneDoe.xlsx
for the annotators John Doe and Jane Doe, respectively. Each file will contain 300 records, including the oldest, randomly selected, and newest records from the dataset.
serve(file, prior_calibration_file, n_records, annotators)
: Main function for selecting and serving records for vocabulary calibration.output_annotation_df(annotation_df, annotators)
: Generates personalized annotation Excel files for each annotator.sort_by_date(df)
: Sorts the dataframe by the year column in ascending order.old_random_new(df, n_records, prior_mid)
: Selects old, random, and new records from the dataframe.main()
: Entry point of the script, parsing command line arguments and initiating the record serving process.
- Annotator Excel Files: Each annotator specified in the command line arguments will receive an Excel file containing the selected records for annotation. These files include necessary columns for recording annotations related to title eligibility, inclusion criteria, and other exclusion reasons.
The FORAS Project Insert Labels script is designed to integrate annotations from annotator-specific files back into the main dataset, referred to as the "motherfile". This process is crucial for consolidating annotated data, facilitating analysis, and ensuring that all annotations are accurately reflected in the central dataset. The script matches records based on a unique identifier (MID), then inserts the corresponding labels and batch information from the annotation file into the motherfile.
- Label Column Integration: Adds empty label columns to the motherfile based on names from the annotation file.
- Record Matching: Identifies matching records between the annotation file and the motherfile using the MID.
- Label Insertion: Inserts label values and batch information from the annotation file into the corresponding records in the motherfile.
- Python 3.x
- Libraries:
pandas
,numpy
,sys
,re
- ASReview (
asreview
) for handling systematic review data configurations
Ensure Python and the necessary libraries are installed on your system. You can install the required Python libraries using pip:
pip install pandas numpy asreview
This script is executed from the command line, taking the motherfile name, the annotation file name, and the batch identifier as arguments.
python insert_labels_script.py <motherfile.csv> <annotation_file.xlsx> <batch>
python insert_labels_script.py dataset.csv JohnDoe_annotations.xlsx Batch1
This command will process JohnDoe_annotations.xlsx
, matching its records with those in dataset.csv
based on MID, and insert the annotations along with the batch identifier (Batch1) into the motherfile.
insert(motherfile, annotation_file, batch)
: Main function to insert label values from the annotation file into the motherfile.add_label_columns(mother_df, annotation_df)
: Adds empty label columns to the motherfile based on the annotation file.set_label_values(mother_frame, annotation_frame, mother_index, annotation_index, columns, batch)
: Inserts label values and batch information into the motherfile.records_match(df1, df2, index1, index2)
: Checks if records in two dataframes match based on MID.main()
: Entry point of the script, parsing command line arguments and initiating the label insertion process.
- Updated Motherfile: The script outputs an updated version of the motherfile with added labels from the annotation file and batch information. The file is saved with a "+labels.csv" suffix to denote that labels have been inserted.
These scripts are part of the FORAS Project, aimed at streamlining the preparation of datasets for systematic reviews and analysis. These scripts are intended for use in research and systematic reviews.
This project is supported by a grant from the Dutch Research Council (grant no. 406.22.GO.048).