Automates the initial steps of the PRISMA statement: split-string generation, article search, search result screening, and data saving to an Excel sheet for further analysis.
- Python 3.x
Note: package developed with Python 3.9.1
(Optional) Create a virtual environment:
python -m venv venv
# Windows
./venv/Scripts/Activate
# Linux
./venv/bin/activate
Install requirements:
pip install -r requirements.txt
Follow instructions in Pybliometrics: How to access Scopus.
The first time you run main.py
, it'll request your API key when it attempts to search Scopus. Follow the instructions in the console.
You can also change your key by editing APIKey
in C:\Users\<YOU>\.pybliometrics\config.ini
under [Authentication]
.
Open main.py
, look for a section named """ Split string generation """
and # Create keyword groups
. Edit/add/delete keyword group variables to your liking.
After setting your keyword groups, run
python main.py
For an example and more details about keyword groups, see Use case.
The project contains two main classes: Splitter
and Collector
.
The Splitter
class is responsible for generating all possible search strings (splits) from the defined keyword groups.
The Collector
class is responsible for interacting with the Scopus API and retrieving results, as well as cleaning them up before saving them to a local directory.
Both classes come with methods to wrap all of their functionality and streamline the process of acquiring search strings and Scopus results. Nonetheless, it's also possible to use the other methods and alter the default functionality to suit any particular needs.
Splitter
has the split()
method, and Collector
has the run()
method. Start by looking into these if you want to understand how everything works.
The following class diagram presents an overview of the system:
The Splitter
class uses a recursive depth-first search to generate all possible keyword combinations. Before that, it's necessary to generate an adjacency graph to represent the tree. The generated combinations are then parsed to generate splits that are searchable in Scopus.
add_kwgroup()
andadd_kwgroups()
: add keyword groups to the Splitter. These keyword groups are then used in generating combinations and splits;generate_kwgraph()
: generates an adjacency graph from the added keywords for use in depth-first search;generate_combinations()
: uses recursive depth-first search to generate all possible keyword combinations;parse_combinations()
: parses keyword combinations into searchable strings;split()
: streamlines the split generation process by calling all other methods, as well as saving generated data to the local directory.
The Collector
class comes with 3 methods: search()
, screen()
, and run()
.
search()
: takes the generated splits as input and searches Scopus. Results are saved in 3 different objects: a Pandas dataframe containing all data from search results (doi, title, etc.), and two lists containing the number of search results and their associated split;screen()
: takes the generated dataframe as input and screens it for duplicates, unnecessary columns (e.g. funding data), conference reviews, and rows without a doi;run()
: streamlines the whole process by callingsearch()
andscreen()
, as well as saving the generated data to the local directory.
Suppose you'd like to look for articles related to extended reality and its applications in brain-computer interfaces and gaming. You come up with the following keywords:
- Virtual Reality
- Augmented Reality
- Extended Reality
- Mixed Reality
- BCI
- Gaming
Then you wonder if Digital Twins are being used alongside these technologies, but it's an optional keyword. Suppose you come up with a search string that looks something like this:
("Virtual Reality" OR "Augmented Reality" OR ("Extended Reality" OR "Mixed Reality")) AND ("BCI" OR "Gaming") AND "Digital Twin"
and the same search string, but without the AND "Digital Twin"
part. Your keyword groups (separated by the AND
operators) are:
Group 1 = "Virtual Reality" OR "Augmented Reality" OR ("Extended Reality" OR "Mixed Reality")
Group 2 = "BCI" OR "Gaming"
Group 3 = "Digital Twin"
As such, your keyword group variables can be defined as:
group_1 = ["Virtual Reality", "Augmented Reality", "Extended Reality || Mixed Reality"]
group_2 = ["BCI", "Gaming"]
group_3 = ["Digital Twin", ""]
Note that ("Extended Reality" OR "Mixed Reality")
became "Extended Reality || Mixed Reality"
(drop parenthesis, swap OR
with ||
), and that group_3
has an empty string as one of the keywords. The empty string signals an optional keyword group, and will generate search strings that don't contain it (i.e. without AND "Digital Twin"
).
Once the variables are declared, you can add them to the Splitter in two ways:
# Create Splitter first!
splitter = Splitter()
# Add one group
splitter.add_kwgroup(group_1)
# Add multiple groups
kw_groups = [group_2, group_3]
splitter.add_kwgroups(kw_groups)
To generate splits, call the split()
method.
splits = splitter.split()
In the ./out
folder, you'll find a new file called splits.txt
. This file contains all the generated splits.
To search Scopus, all you need to do is call the run()
method in the Collector
class with splits
as the parameter:
collector = Collector()
collector.run(splits)
Note: see Limitations about subscriber access and the run()
method.
You'll find 3 new files in the ./out
folder:
search_results.txt
: contains the splits that had less than 1000 results (configurable through thethreshold
parameter in theCollector.collect()
method, upto 5000) and of which results were saved, as well as the amount of results found;excluded_results.txt
: contains the splits that were excluded from the search because they had too many results. Also contains the number of results of each split.final_results.xlsx
: contains data regarding the collected documents from Scopus, as well as the split used to find it.
Open up search_results.txt
and excluded_results.txt
to analyse the effectiveness of your splits. Open final_results.xlsx
to continue with the PRISMA statement: analyse which articles aren't relevant to your research, exclude them, and continue!
To run the test suite, run
pytest
while in the root of the project (where test_suite.py
is).
The Scopus API is limited for those without subscriber access. This means that, by default, you won't have access to these critical data:
- abstract text;
- author keywords.
prisma-automator
assumes non-subscriber access by default. If you have an API Key with subscriber access, you'll need to override the parameters of the Collector
class' run()
method by setting collector.run(splits, subscriber=True)
.
For more on getting subscriber access, see Pybliometrics: How to access Scopus.
To sum it up, the quickest methods are (assuming your institution has subscriber access):
- you are in your instition’s network;
- you use your instition’s VPN.
For limitations on API Key quotas, see Pybliometrics: API Key quotas and 429 error