Skip to content

slr-kit is a framework to support the analysis of scientific literature based on topic modeling, allowing to extract useful statistics

License

Notifications You must be signed in to change notification settings

Thomas-Lohith/slr-kit

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

slr-kit: a tool to assist the writing of Systematic Literature Reviews

slr-kit is a tool that is intended to assist in the development of a Systematic Literature Review (SLR) of scientific papers related to a specific scientific field.

The tool is based on a semi-supervised workflow that leverages automatic techniques based on text mining to automatize key steps in the workflow, while some manual operations are necessary to provide the expert knowledge to the process. slr-kit aims at making the manual operations as efficient as possible.

It analyzes the abstracts of scientific papers to derive the topics associated to the papers, allowing the clustering of papers by topic and the derivation of useful statistics about the set of papers and their topics, such as the trend of publications per topic during the years, and the amount of papers published on different journals by topic.

As a byproduct, slr-kit can be used to manage a list of terms related on a specific topic, curated by an expert, which can be useful in many NLP applications.

Table of contents

  • Installation
  • Usage
  • The workflow on data from scratch (basic version)
  • Using existing data
  • Links to other docs

Installation

Using pip

slrkit is available on PyPi and can thus be installed as

pip install slrkit

The documentation below refers to the installed program though pip.

From git repository

Clone the repository:

git clone <repository address>

Install the dependencies:

pip install -r requirements.txt

In this case, the command can be run, from within the repository directory, as

python3 slrkit/slrkit.py

Install nltk corpora

After installing slrkit and its dependencies, the corpora for NLTK must be downloaded. The following command can be used:

python -m nltk.downloader popular

Usage

If slr-kit was installed from the git repository, you can launch the slrkit command. Use the -h or --help option to generate the following help:

$ slrkit --help
usage: slrkit [-h] [-C path] [--version]
              {init,import,journals,acronyms,preprocess,terms,fawoc,postprocess,topics,report,record,stopwords,build,readme}
                 ...

slrkit project handling tool

options:
  -h, --help            show this help message and exit
  -C path               Change directory to 'path' before running the specified command.
  --version, -V         show program's version number and exit

slrkit commands:
  {init,import,journals,acronyms,preprocess,terms,fawoc,postprocess,topics,report,record,stopwords,build,readme}
    init                Initialize a slr-kit project
    import              Import a bibliographic database converting to the csv format used by slr-kit.
    journals            Subcommand to extract and filter a list of journals. This command accepts a
                        subcommand. If none is given, "extract" is assumed.
    acronyms            Extract acronyms from texts.
    preprocess          Run the preprocess stage in a slr-kit project
    terms               Subcommand to extract and handle lists of terms in a slr-kit project. This
                        command accepts a subcommand. If none is given, "generate" is assumed.
    fawoc               Run fawoc in a slr-kit project. This command accepts a subcommand. If none is
                        given "terms" is assumed.
    postprocess         Run the postprocess stage in a slr-kit project
    topics              Subcommand to extract the topics from the documents in a slrkit project. This
                        command requires a subcommand.
    report              Run the report creation script in a slr-kit project.
    record              Record a snapshot of the project in the underlying git repository
    stopwords           Extracts the terms classified as stopwords from the terms file
    build               Run all the commands needed to re-create all the files after cloning a slr-kit
                        project.
    readme              Creates a README.md file for the project using information from the META.toml
                        file. The README.md is also committed to the git repository.

Each sub-command is explained in details the slrkit.md documentation file.

The workflow on data from scratch (basic version)

The organization of the information is based on projects. A project is basically a directory containing all the necessary data related to the SLR. Such data are manipulated using the slrkit commands.

All the data, being final or intermediate results, are stored in easy to inspect CSV files.

The workflow supported by slr-kit includes several optional steps that can be done to refine the final results. Here it is reported the mandatory steps to obtain the desired reports starting from the bibliographic material.

The basic steps are:

  • initialization of the project (automatic)
  • import of bibliographic data (automatic)
  • preprocessing of the abstracts (automatic)
  • extraction of terms (automatic)
  • classification of terms using FAWOC (manual)
  • topics extraction (automatic)
  • generation of reports (automatic)

Anytime during the various steps, you can run the command

slrkit record

that will produce a snapshot of the current information required by an slr-kit project. Note that the snapshots are maintained as standard git repositories. Therefore, you can use the regular git commands to operate on the repository (git log, git status, git checkout, etc.). Moreover, you can use the usual git push/pull to move the dataset to/from remotes.

Initialization of the project (automatic)

Create a new directory to host the slr-kit project, and move into the newly created directory (we use my-project as example project):

mkdir my-project && cd my-project

Initialize the project:

slrkit init my-project

Import of bibliographic data (automatic)

At the moment, only the RIS file format is supported as source data file. The RIS format is one of the many available on, e.g., Scopus. Make sure that the file contains the Abstract of the papers.

Copy the bibliographic data into the project directory, naming it my-project.ris.

To make slrkit aware of the name of the bibliographic data, set the value of the input_file parameter into the file configuration file slr-kit.conf/import.toml (this file was generated by the init command) as follows:

input_file = "my-project.ris"

To import the data into the project, run the command:

slrkit import

Preprocessing of the abstracts (automatic)

This operation performs several text processing operations on the abstracts, such as stemming and removal of punctuation, preparing the text for further processing.

The command to run the preprocessing is:

slrkit preprocess

Extraction of terms (automatic)

Extract the list of terms from the abstracts:

slrkit terms

Classification of terms using FAWOC (manual)

Run FAWOC and manually classify the terms:

slrkit fawoc terms

FAWOC is an interactive terminal program that allows to quickly (well, some time is required depending on the amount of terms) classify the terms.

For this basic example, you can distinguish the terms between RELEVANT (pressing r) and NOISE (press n) terms. If you make some mistake, press u to undo.

Postprocessing (automatic)

This operation processes the documents by managing the terms classified in the previous stage. Run it with

slrkit postprocess

Extraction of topics (automatic)

Run the text mining procedure with the command to generate the topics:

slrkit topics

Alternatively, you may want to run an optimization in the generation of the topics with:

slrkit topics optimize

This operation takes significantly more time, but produces results that are far more accurate.

HINT: Use the optimization everytime a "production ready" result is required.

Generation of reports (automatic)

Generate the reports with:

slrkit report

The reports are stored in a newly generated unique directory whose name has the format <timestamp>_<UUID>_reports/. In the directory, you will find some .md, .tex and .png files that report the graphs and tables with the results. The reports are generated according to some templates in Markdown and LaTeX format for being easily edited or directly included in scientific papers or websites.

Using existing data

slr-kit can also be used on an existing dataset, i.e., a set of papers that were collected by someone else and already (even partially) elaborated.

In this example, we assume that the dataset was properly recorded using slrkit record, and all the terms have been already classified.

Clone a repository with labelled data

A dataset is kept versioned using git. Therefore, if the dataset is shared in some online repository, you can simply download it with

git clone <repository>

A repository containing the classification of a dataset about Natural Language Processing is available here: https://github.com/robolab-pavia/slrkit_NLP_Nocera.

Regenerate all the necessary data

First, you need to re-build all the preprocessed data. This is done with one single command:

slrkit build

At this point, if necessary, you can refine the classification of terms by running the usual command

slrkit fawoc terms

Generate topics and reports

Now you are ready to generate the topics and the reports with the same commands explained above:

slrkit topics optimize

followed by

slrkit report

Links to other docs

About

slr-kit is a framework to support the analysis of scientific literature based on topic modeling, allowing to extract useful statistics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.6%
  • TeX 0.4%