The slrkit
command helps to handle a slr-kit project.
This command automates and handles all the phases of the document analysis.
To do so, a set of configuration files are used in order to automate the process.
These files are stored in a slr-kit project and are created and managed by the slrkit
command itself.
An SLR-KIT project is a collection of files generated by the SLR-KIT scripts.
All of these files, are generated from a set of documents that the user wishes to analyze.
A project is also a git
repository.
The slrkit
command initializes this repository when the project is first created and helps to track only the meaningful files.
An SLR-KIT project is a directory that contains all the files related to an analysis.
This directory must contain also a META.toml
file and a project configuration directory.
This file contains all the metadata about the project.
It must be a TOML version 1.0.0 file.
It must contain two dictionaries: Project
and Source
.
This contains information about the project such as the name of the project, a description, the location of the configuration directory and some more. The allowed keys and their meaning are described in the following table:
Key | Description | Type |
---|---|---|
Author | Information about the author of the project | string |
Config | Name of the configuration directory | string |
Description | Description of the project | string |
Keywords | List of keywords related to the documents | list of strings |
Name | Name of the project | string |
This contains information about the source of the documents analyzed in the project. The allowed keys and their meaning are described in the following table:
Key | Description | Type |
---|---|---|
URL | URL of the site used to retrieve the documents | string |
Query | Query string used to retrieve the documents | string |
Date | Date on which the documents were retrieved | string |
Origin | Description of the origin of the documents | string |
The Origin
key is meant to be used when the documents are retrieved without the help of a bibliographical search engine.
In this case the URL
and Query
keys shall be left empty.
This directory is located inside the project directory.
Its name is saved in the META.toml
file in the Project.Config
key.
The default name for this directory is slrkit.conf
but a different name may be used.
This directory contains all the configuration files used by the project.
The configuration files must be TOML v. 1.0.0 files.
Information about each file can be found in the documentation of each slrkit
sub-command.
Each relative path included in the configuration files are considered to be relative to the project root directory.
The configuration directory contains also a log
directory that contains all the log files produced during the project.
All the scripts that write a log use the slr-kit.log
log file saved in the log directory.
The slrkit
command is the tool to handle a project.
It uses the META.toml
and the file in the project configuration directory to automate the operations.
It is composed by some sub-commands to handle and automate all the phases of a project.
Usage:
python3 slrkit.py [-C /path/to/project] sub-command sub-command-arguments ...
The sub-commands are:
init
: initialize a slr-kit projectimport
: import a bibliographic database converting to the csv format used by slr-kit.journals
: sub-command to extract and filter a list of journals. Requires a sub-command.acronyms
: extract acronyms from texts.preprocess
: run the preprocess stage in a slr-kit projectterms
: sub-command to extract and handle lists of terms in a slr-kit project. Requires a sub-commandfawoc
: run fawoc in a slr-kit project.topics
: extract topic from the documents of a slr-kit project.report
: run the report creation script in a slr-kit project.record
: record a snapshot of the project in the underlying git repository.stopwords
: extracts a list of terms classified as stopwords from the terms file.build
: re-create the not versioned files after a git clone.
Each command operates on the directory from which the slrkit
command is run.
The -C
option allows to change the current directory to the one specified.
Usually the workflow is the following:
- initialize a project with the
init
command. Fill the information in theMETA.toml
file and save in the project the bibliographical database with the information on the paper. The name of this file must be written in theimport.toml
file in the configuration directory; - import the data in the bibliographical database into a
csv
file with theimport
command; - (optional) create a list of the journals that have published the paper with the
journals extract
command. This list can be reviewed and classified to exclude papers from not relevant journals; - (optional) review the list of journals with the
fawoc journals
command; - (optional) use the classification made in the step above to mark the papers that comes from a discarded journal. This step can be done with the
journals filter
command; - (optional) extract a list of acronyms with the
acronyms
command. This list can be rewiwed to find the relevant acronyms; - (optional) classify the acronyms with the
fawoc acronyms
; - select the stop-words that have to be filtered from the paper. The stop-words must be stored in one or more file. Their names must be included in the
preprocess.toml
file in the configuration directory; - (optional) if there are lists of terms that are surely relevant, these lists must be stored in the project, and their names must be included in the
preprocess.toml
file; - prepare the text for the elaboration with the
preprocess
command; - generate the list of terms with the
terms generate
command; - classify the terms with the
fawoc terms
command; - extract the topic and retrieve the document-topic association with the
lda
command; - prepare a report with some statistics about the papers with the
report
command;
Before running any command is highly recommended reviewing its settings file to check if everything is correct.
Optionally the optimize_lda
(faster) or the lda_grid_search
(slower) commands can be used to find the best LDA model.
The record
command is designed to record the meaningful files of the project in a git
repository.
Its use is highly recommended.
The stopwords
command allows to retrieve a list of stopwords identified during the classification of the terms.
The list created by this command can be used to refine the generation of the terms.
Initialize the current directory as an SLR-KIT project. Usage:
python3 slrkit.py init [--author AUTHOR] [--description DESCRIPTION] [--no-backup] name
Argument name
is the name of the project.
It will be used as a prefix for all the suggested file names.
The --author
option allows to specify the project author while the --description
option allows to specify the project description.
It creates the META.toml
files with information from the command line.
The user shall complete the content of this file.
This command also creates the configuration directory.
This directory is populated with all the configuration files handled by the slrkit
command.
The file format is TOML version 1.0.0.
The name of each file is the name of the slrkit
sub-command (e.g. preprocess.toml
is the configuration file for the slrkit preprocess
command) and it contains a key for each parameter of the corresponding script.
Refer to the documentation of each script and command for additional information about the configuration parameters.
In each file some comments explain each parameter, and the output file name of each script are suggested with some good default names.
The init
command also copies the ga_param.toml
file to the configuration directory with the name optimize_lda_ga_params.toml
.
This file is used by the optimize_lda
command for the parameters used in the optimization.
See the documentation of the optimize_lda
command for more information.
This command can be executed on an already initialized project.
In this case the information in the META.toml
are updated with the ones given on the command line.
All the other fields are left untouched.
The configuration files are updated.
If one or more option are missing, they are filled with the default value.
The other information are not changed.
The original toml
files are backupped in the configuration directory before any modification.
The backups have the same name of the original files with the extension .bak
.
If the user gives the --no-backup
option, no backup is performed.
The init
sub-command also initializes the git
repository of the project.
A .gitignore
file is provided.
Its content is produced collecting the output of the to_ignore
function of each module.
Each module is imported and if a to_ignore
function is defined, it is called with the content of the configuration file of the script as a dictionary.
This function must return a list of file names to ignore.
If there is something wrong in the configuration data, the function must raise a ValueError
exception with the reason of the error.
The message of the exception is used to create the error message to show to the user.
A first commit is recorded with:
- the
META.toml
file; - all the configuration files;
- the provided
.gitignore
.
This command imports a bibliographical database into the project, converting it to the csv
format used by all the scripts.
The output of this command will be called the abstracts file in the rest of this document.
Usage:
python3 slrkit.py import [--list_columns]
The import
sub-command uses the import.toml
configuration file and runs the import_biblio.py
script.
It imports the database in a csv
usable by the other commands.
To each paper is assigned a progressive identification number in the column id
.
All the selected columns are imported from the input file.
The citation count for each paper is also retrieved and imported as the column citation
.
If the option --list_columns
is set, the command outputs only the list of available columns of the input file specified in the configuration file and no data is imported.
The import.toml
has the following structure:
input_file
: path to the bibliographical database to import. Important: this field is not pre-filled by theinit
command, the user must fill it before running theimport
command. This file is committed togit
repository by therecord
command;type
: type of the database to import. Actually the only supported type isRIS
;output
: name of the output file. It is pre-filled with<project-name>_abstracts.csv
;columns
: comma separated list of columns to import. It is pre-filled withtitle,abstract,year,journal,citation
.
This command allows the user to retrieve a list of journals and classify them in order to filter out the not relevant ones and the papers published on them.
This command accepts two sub-commands:
extract
: extracts the list of journals from the abstracts file;filter
: uses the manual classification of the list of journals to filter out the papers published on the not relevant journals.
Usage:
python3 slrkit.py journals {extract, filter}
If the journals
command is invoked without a sub-command, the extract
sub-command is run.
The extract
sub-command produces a list in the format used by FAWOC
.
The structure is the following:
id
: a progressive identification number;term
: the name of the journal;label
: the label added byFAWOC
to the journal. This field is left blank by theextract
sub-command;count
: the number of papers published in the journal.
FAWOC
will move the count field in the fawoc_data.tsv
file.
The extract
sub-command uses the journals_extract.toml
configuration file and runs the journal_lister.py
script.
The journals_extract.toml
file has the following structure:
abstract_file
: name of the abstracts file. It is pre-filled with<project-name>_abstracts.csv
;outfile
: name of the output file. It is pre-filled with<project-name>_journals.csv
.
The filter
sub-command filters the papers using the manual classification of the list of journals.
It adds the status
column to the abstracts file.
This column will have the value good
for the papers published in a journal classified with relevant
or the keyword
label.
All the papers from journals not classified as relevant
or keyword
will be marked with the rejected
value in the status
column.
The filter
sub-command uses the journals_filter.toml
configuration file and runs the filter_paper.py
script.
The journals_filter.toml
file has the following structure:
abstract_file
: name of the abstract file. This file is used as both input and output. It is pre-filled with<project-name>_abstracts.csv
;journal_file
: name of the journal list file produced byjournal extract
. It is pre-filled with<project-name>_journals.csv
.
This commands extracts acronyms from the papers.
Its output format is suitable to be used with FAWOC
to classify which acronym is relevant or not.
If the input file (the abstract file) contains the status
column created by the journals filter
command, the acronyms
command uses that column value to filter out the paper published in the rejected journals.
The output of this command will be called the acronyms file in the rest of this document.
Usage:
python3 slrkit.py acronyms
The acronyms
sub-command uses the acronyms.toml
configuration file and runs the acronyms.py
script.
The output is in tsv
format and has the following structure (suitable for FAWOC
):
id
: a progressive identification number;term
: the acronym in the formextended-acronym | (abbreviation)
;label
: the label added byFAWOC
to the acronym. This field is left blank by theacronyms
command.
No fawoc_data
file is produced, so no count
field is available for FAWOC
.
After a correct execution, the command changes the preprocess.toml
file updating its acronyms
field with the name of the output file.
All the commands consider only the acronyms classified with the relevant
or the keyword
label.
All the other acronyms are not considered.
The acronyms.toml
has the following structure:
datafile
: input file. It is pre-filled with the value<project-name>_abstracts.csv
;output
: output file. It is pre-filled with the value<project-name>_acronyms.csv
;columns
: name of the column ofdatafile
with the text to elaborate. It is pre-filled with the valueabstract
.
The preprocess
sub-command prepares the documents for the following elaborations.
Usage:
python3 slrkit.py preprocess
If the input file (the abstract file) contains the status
column created by the journals filter
command, the preprocess
command uses that column value to filter out the paper published in the rejected journals.
It also filters the stop-words using the list of words provided by the user.
No default list of stop-words is used, the user must provide his own lists.
This command also uses the acronyms file to search and mark the acronyms as relevant words.
Only the acronyms with the relevant
or the keyword
label are considered.
The preprocess
command, also mark as relevant all the terms provided by the user in the relevant terms lists.
The user can also choose how the command marks this terms.
The input of this command is the abstract file.
The output of this command is the abstract file without the paper discarded because published in rejected journals.
To this file is also added a new column with the text of each paper preprocessed.
More information can be found in the preprocess.py section of the README.md
The preprocess
sub-command uses the preprocess.toml
configuration file.
This file has the following structure:
datafile
: the name of the abstract file that will be used as input. This field is pre-filled with<project-name>_abstracts.csv
;output
: output file name. This field is pre-filled with<project-name>_preproc.csv
;placeholder
: placeholder used to mark the barriers (the stop-words and the punctuation). This character is also used as prefix and suffix for the placeholder for the relevant terms and the acronyms. It is pre-filled with the character@
;stop-words
: lists of stop-words provided by the user. No other lists are used, so the user shall provide its own;relevant-term
: lists of relevant terms. This field is particular. Each element must be a list of at least one item and at most two items. The first item is the name of a list of relevant terms. The second one, if present, is the marker that the user want to be used for all the terms in this list. All the terms will be marked with<placeholder><marker><placeholder>
. If the marker is omitted than the command replaces every term using the placeholder, all the words of the term separated with_
character and then another placeholder;acronyms
: name of the acronyms file. If theacronyms
command is run before, it is pre-filled with<project-name>_acronyms.csv
;target-column
: name of the column used for the document text. It is pre-filled withabstract
;output-column
: name of the column that is added to the output, containing the preprocessed text. It is pre-filled withabstract_lem
;input-delimiter
: input file field delimiter. It is pre-filled with\t
;output-delimiter
: input file field delimiter. It is pre-filled with\t
;rows
: number of rows of the input file to process. If empty, all the rows are used;language
: language of text. Must be a ISO 639-1 two-letter code. Pre-filled withen
;regex
: csv file with some dataset specific regex substitutions that has to be applied to the text.
The output of this command will be called the preprocess file in the rest of this document.
This command allows the user to generate and handle lists of terms.
This command accepts one sub-command:
generate
: generate a list of terms that have to be classified.
Usage:
python3 slrkit.py terms {generate}
If the terms
command is invoked without a sub-command, the generate
sub-command is run.
The generate
sub-command generates a list of terms from the documents in the preprocess file.
This command runs the gen_terms.py
script.
The format of this list is the one used by FAWOC
. The structure is the following:
id
: a progressive identification number;term
: the n-gram;label
: the label added byFAWOC
to the n-gram. This field is left blank by theterms generate
command.
This command produces also the fawoc_data.tsv
file, with the following structure:
id
: the identification number of the term;term
: the term;count
: the number of occurrences of the term.
The output of this command will be called the terms file in the rest of this document.
The terms generate
sub-command uses the terms_generate.toml
configuration file.
It has the following structure:
datafile
: name of the input file (the preprocess file). It is pre-filled withproject-name>_preproc.csv
;output
: name of the output file. It is pre-filled with<project-name>_terms.csv
;stdout
: iftrue
the command also print the output to the standard output;n-grams
: maximum size of an n-gram. All the n-gram with lengths from one word to this number of words are generated. By default, this field is filled with4
;min-frequency
: minimum number of occurrences of an n-gram. All the n-gram with less occurrences than this value are discarded. Pre-filled with5
;placeholder
: placeholder used to mark the barriers in thepreprocess
stage. All the n-grams containing this character or containing words that start and end with this character are discarded. It is pre-filled with the character@
;column
: column of the input file with the text to elaborate. Pre-filled withabstract_lem
;delimiter
: field delimiter used by the input file. Pre-filled with\t
.
The fawoc
command runs FAWOC
on a list produced by the previous commands.
This command accepts three sub-commands:
terms
: runFAWOC
on the terms file;journals
: runFAWOC
on the journals file;acronyms
: runFAWOC
on the acronyms file.
Usage:
python3 slrkit.py fawoc [--input LABEL] [--width WIDTH]
the optional arguments are passed to FAWOC
and override the corresponding values in the configuration file:
--input
: label to review;--width
: width of theFAWOC
windows, in number of columns.
If the fawoc
command is invoked without a sub-command, the terms
sub-command is run.
Each sub-command writes to its own profiler file, in the log
directory of the project.
The fawoc terms
sub-command allows the user to classify the terms file.
This command uses the fawoc_terms.toml
configuration file that has the following structure:
datafile
: file to classify. Pre-filled with<project-name>_terms.csv
;input
: label to review;dry-run
: iftrue
,FAWOC
don't write anything on thedatafile
on exit;no-auto-save
: iftrue
, no auto saving;no-profile
: iftrue
, no data is written to the profiler file;width
: width of theFAWOC
windows in columns.
The profiler file for this sub-command is fawoc_terms_profiler.log
in the log
directory of the project.
The fawoc journals
sub-command allows the user to classify the journals file.
This command uses the fawoc_journals.toml
configuration file.
Its structure is the same of the fawoc_terms.toml
.
The only difference is that the datafile
field is pre-filled with <project-name>_journals.csv
.
The profiler file for this sub-command is fawoc_journals_profiler.log
in the log
directory of the project.
The fawoc acronyms
sub-command allows the user to classify the acronyms file.
This command uses the fawoc_acronyms.toml
configuration file.
Its structure is the same of the fawoc_terms.toml
.
The only difference is that the datafile
field is pre-filled with <project-name>_acronyms.csv
.
The profiler file for this sub-command is fawoc_acronyms_profiler.log
in the log
directory of the project.
The topics
command extracts topics from the documents of the project.
This command accepts the following sub-commands:
extract
: extracts topics from the documents;optimize
: optimizes the parameter of the topic extraction algorithm and uses that parameters to extract topics.
The sub-command is always required.
This sub-command trains an LDA model and outputs the extracted topics and the association between topics and documents.
Usage:
python3 slrkit.py topics extract [--config CONFIG | --directory DIRECTORY] [--uuid UUID] [--id ID]
Optional arguments:
--config | -c CONFIG
: specifies a different configuration file than the default one;--directory | -d DIRECTORY
: specifies the path to the directory with the results of the optimization phase;--uuid | -u UUID
: UUID of the model stored in the result directory.--id ID
: 0-based id of the model stored in the result directory. The associaction between id and model is stored in theresults.csv
file of the result directory. This file is sorted by coherence so the id 0 is the best model. If both--uuid
and this option are missing and the--directory
is present,--id
is assumed with value 0.
The --config
and the --direcotry
are mutually exclusive.
Also, the --uuid
and the --id
option are mutally exclusive.
The --directory
, in conjunction with the --uuid
or the --id
, allows the user to select one model of a run of the optimize_lda
command (or the lda_ga.py
script).
If one of the --uuid/--id
option is present, the --directory
is required, otherwise the command ends with an error.
This command runs the lda.py
script.
The topicws extract
sub-command uses, by default, the lda.toml
configuration file that has the following structure:
preproc_file
: name of the preprocess file. Pre-filled with<project-name>_preproc.csv
;terms_file
: name of the terms file. Pre-filled with<project-name>_terms.csv
;outdir
: path to the directory where to save the results. Pre-filled with the path to project directory;text-column
: column of the preprocess file to elaborate. Pre-filled withabstract_lem
;title-column
: column in the preprocess file to use as document title. Pre-filled withtitle
;topics
: number of topic to extract. Pre-filled with20
;alpha
: alpha parameter of LDA. Pre-filled withauto
;beta
: beta parameter of LDA. Pre-filled withauto
;no_below
: keep tokens which are contained in at least this number of documents. Pre-filled with20
;no_above
: keep tokens which are contained in no more than this fraction of documents (fraction of total corpus size, not an absolute number). Pre-filled with0.5
;seed
: seed to be use in training;model
: iftrue
the lda model is saved to directory<outdir>/lda_model
. The model is saved with name "model";no-relevant
: if set, use only the term labelled askeyword
in the terms file;load-model
: path to a directory where a previously trained model is saved. Inside this directory the model named "model" is searched. the loaded model is used with the dataset file to generate the topics and the topic document association;no_timestamp
: iftrue
, no timestamp is added to the output file names;placeholder
: placeholder for the barriers. Pre-filled with@
;delimiter
: field delimiter used in the preprocess file. Pre-filled with\t
.
The command manage to set the PYTHONHASHSEED
to 0 so setting the seed
value is enough to have reproducible runs.
More information on the PYTHONHASHSEED
variable can be found here.
This sub-command runs the lda_ga.py
script to find the best combination of parameters for an LDA model.
Usage:
python3 slrkit.py topics optimize
The topics optimize
sub-command uses the optimize_lda.toml
configuration file that has the following structure:
preproc_file
: name of the preprocess file. Pre-filled with<project-name>_preproc.csv
;terms_file
: name of the terms file. Pre-filled with<project-name>_terms.csv
;ga_params
: path of the file with the parameters used by the GA. Pre-filled with the absolute path to theoptimize_lda_ga_params.toml
file in the configuration directory;outdir
: path to the directory where to save the results. Pre-filled with the path to project directory;text-column
: column of the preprocess file to elaborate. Pre-filled withabstract_lem
;title-column
: column in the preprocess file to use as document title. Pre-filled withtitle
;seed
: seed to be use in training;placeholder
: placeholder for the barriers. Pre-filled with@
;delimiter
: field delimiter used in the preprocess file. Pre-filled with\t
.no_timestamp
: iftrue
, no timestamp is added to the output file names;
The ga_params
file has the following structure:
limits
: this section contains the ranges of the parameter;min_topics
: minimum number of topics;max_topics
: maximum number of topics;max_no_below
: maximum value of the no-below parameter. The minimum is always 1. A value of -1 means a tenth of the number of documents;min_no_above
: minimum value of the no-above parameter. The maximum is always 1.
algorithm
: this section contains the parameters used by the GA:mu
: number of individuals that will pass each generation;lambda
: number of individuals that are generated at each generation;initial
: size of the initial population;generations
: number of generation;tournament_size
: number of individuals randomly selected for the selection tournament.
probabilities
: this section contains the probabilities used by the script:mutate
: probability of mutation;component_mutation
: probability of mutation of each individual component;mate
: probability of crossover (also called mating);no_filter
: probability that a new individual is created with no term filter (no_above = no_below = 1);
mutate
: this section contains the parameters of the Gaussian distributions used by the mutation for each parameter:topics.mu
andtopics.sigma
are the mean value and the standard deviation for the topics parameter;alpha_val.mu
andalpha_val.sigma
are the mean value and the standard deviation for the value of the alpha parameter;beta.mu
andbeta.sigma
are the mean value and the standard deviation for the beta parameter;no_above.mu
andno_above.sigma
are the mean value and the standard deviation for the no_above parameter;no_below.mu
andno_below.sigma
are the mean value and the standard deviation for the no_below parameter;alpha_type.mu
andalpha_type.sigma
are the mean value and the standard deviation for the type of the alpha parameter.
Refer to the documentation of the lda_ga.py
script in README.md for more information about the behaviour of the script and the GA parameters.
The script outputs all the trained models in <outdir>/<date>_<time>_lda_results/models/<UUID>
.
The command outputs also the topics and the documents topics correspondence for each trained model.
For each trained model is it produced a toml
file with all the parameter already set to use the corresponding model with the lda.py
script or the lda
command.
These toml
files are saved in <outdir>/<date>_<time>_lda_results/toml/<UUID>.toml
, and can be loaded in the lda.py
script or the topics extract
command using its --config
option.
It also outputs a tsv file in <outdir>/<date>_<time>_lda_results/results.csv
with the following format:
id
: progressive identification number;topics
: number of topics;alpha
: alpha value;beta
: beta value;no_below
: no-below value;no_above
: no-above value;coherence
: coherence score of the model;times
: time spent evaluating this model;seed
: seed used;uuid
: UUID of the model;num_docs
: number of document;num_not_empty
: number of documents not empty after filtering.
The script, also outputs the extracted topics and the topics-documents association produced by the best model.
The topics are output in <outdir>/lda_terms-topics_<date>_<time>.json
and the topics assigned
to each document in <outdir>/lda_docs-topics_<date>_<time>.json
.
A txt file with a summary of the results is also produced with name <outdir>/lda_info_<date>_<time>.txt
.
The command manage to set the PYTHONHASHSEED
to 0 so setting the seed
value is enough to have reproducible runs.
More information on the PYTHONHASHSEED
variable can be found here.
The report
command produces some reports with statistics about the papers analyzed by the lda
command.
This command runs the topic_report.py
script.
Usage:
python3 slrkit.py report [docs_topics_file terms_topics_file]
With no arguments, the command searches all the lda_docs-topics*.json
and lda_terms-topics*.json
files in the current directory and uses the most recent one for each type.
Files with that names are the ones produced by the lda
command and contains the association between documents and topics and the association between terms and topics.
The docs_topics_file
and terms_topic_file
options, allow the user to select a different set of JSON files.
The command uses the report.toml
configuration file that has the following structure:
abstract_file
: name of the abstracts file of the project. It is pre-filled with<project-name>_abstarcts.csv
;dir
: output directory where the templates and the reports are saved. If empty, the current directory is used;minyear
: minimum year to consider. If empty, the minimum year found in the data is used;maxyear
: maximum year to consider. If empty, the maximum year found in the data is used.plotsize
: number of topics to be displayed in each subplot saved in the report directory;compact
: if true the command creates a compact table for the topics;no_stats
: if true the topics table do not show the statistics about terms.
On the first run, the command copies the report_template.md
and report_template.tex
from the report_template
directory inside this repository, to the current project.
These two files are used to create the reports.
The user can customize the two copied template as he wishes.
The command creates a directory named report<timestamp>
containing:
- the report (called
report.md
) in markdown format; - the report (called
report.tex
) in LaTeX format; - a figure in png format (called
reportyear.png
) used by the two reports above; - a directory
tables
with some LaTeX files used by the LaTeX report.
For information about the statistics reported, refer to the topic_report.py
documentation in the README file.
The record
command creates a commit in the git repository of the project.
This commit records all the data and the configuration of the project.
Usage:
python3 slrkit.py record [--clean] [--rm] message
The message
argument is the commit message to use for the commit.
It cannot be the empty string.
The optional arguments are:
--clean
: this flag tells the command to clean the repository index from file not referenced in the configuration files. These files are left in the project, but they become untracked;--rm
: this flag tells the command to clean the project removing files not referenced in the configuration files. This flag remove these files from the repository index and from the file-system. Use with caution.
The command records the following files:
- the modifications made to the
META.toml
file; - all the modified configuration file;
- all the modifications made to the
.gitignore
file; - the README.md file if present;
- the bibliographic database used as input of the import file;
- the journals file;
- the acronyms file;
- the stop-words lists used by the
preprocess
command, if any; - the relevant terms lists used by the
preprocess
command, if any; - the terms file, with the corresponding
fawoc_data.tsv
file; - all the profiler files created by the
fawoc
sub-commands.
The names of the files listed from 5 to 11 are taken from the configuration files of the commands that generates/uses them.
These files are committed only if they exist in the project at the moment the report
command is run.
If one of these files is deleted, or its name is not referenced anymore in the configuration files, the record
command does not remove the file from the repository unless the --clean
flag is set.
With the --rm
flag the record
command removes and deletes the file that are not referenced anymore in the configuration files.
Use this option with caution.
The record
command does not use any configuration file.
The record
command uses the to_record
function of all the script used by the slrkit command to retrieve the list of file to record.
The command imports each script and searches for the to_record
function.
If present, this function is called with the content of the configuration file of the script as a python dictionary.
The function must return a list of file names to record.
If there is something wrong in the configuration data, the function must raise a ValueError
exception with the reason of the error.
The message of the exception is used by the record
command to create the error message to show to the user.
The stopwords
command extracts a list of terms classified as stopwords from the terms file.
The command searches the terms labelled as stopword
in the terms file (the file that is the input of the fawoc terms
command) and outputs the list of these terms (one per line).
The file created, is added to the stop-word
list in the preprocess.toml
.
Usage:
python3 slrkit.py stopwords [--no-add] output
The output
argument is the output file of the command
The --no-add
optional arguments allows to not add the output file to the stop-word
list in preprocess.toml
.
The build
command executes the command required to re-create the files not versioned.
This command is helpful after a cloning a slrkit project.
For more information see this section
The command executes the following commands in order:
import
;journal filter
;preprocess
.
Usage:
python3 slrkit.py build
The readme
command creates and git commits a README.md file for the project.
The information are taken from the META.toml
file.
More precisely the following information are used:
- from the
Project
section:Name
;Author
;Description
;
- from the
Source
section:URL
if present orOrigin
;Date
;Query
.
If one or more of these fields are empty, the command simply skips that README part. After the README is created, it is committed to the git repository.
Usage:
python3 slrkit.py readme
This section documents some commands that are not directly available.
They can be activated and used modifying the code of the slrkit
command.
They are:
lda_grid_search
: grid search optimization of the LDA parameters.
Modify the argparse
sub-parser of the topics optimize
sub-command to accept a boolean option named grid-search
.
All the other code is ready.
The lda_grid_search
command performs a grid search on the LDA model parameters and outputs all the trained models.
The command searches the best combination (in terms of coherence) of number of topics, alpha, beta, no-below and no-above parameters. It searches all the possible combinations of parameters, discarding all the cases that results with all the documents empty.
This command uses the lda_grid_search.toml
configuration file that has the following format:
preproc_file
: name of the preprocess file. Pre-filled with<project-name>_preproc.csv
;terms_file
: name of the terms file. Pre-filled with<project-name>_terms.csv
;outdir
: path to the directory where to save the results. Pre-filled with the path to project directory;text-column
: column of the preprocess file to elaborate. Pre-filled withabstract_lem
;title-column
: column in the preprocess file to use as document title. Pre-filled withtitle
;min-topics
: minimum number of topics to test. Pre-filled with 5;max-topics
: maximum number of topics to test. Pre-filled with 20;step-topics
: step used to create the grid of topics values. Pre-filled with 1;seed
: seed to be use in training;plot-show
: iftrue
, a plot of the coherence is shown;plot-save
: iftrue
the plot of the coherence is saved as<outdir>/lda_plot.pdf
;placeholder
: placeholder for the barriers. Pre-filled with@
;delimiter
: field delimiter used in the preprocess file. Pre-filled with\t
.
The command runs the lda_grid_search.py
script.
Refer to its documentation in the README for the criteria used to set up the grid of parameters.
To each trained model it is assigned an UUID.
The command outputs all the models in <outdir>/<date>_<time>_lda_results/<UUID>
.
It also outputs a tsv file in <outdir>/<date>_<time>_lda_results/results.csv
with the following format:
id
: progressive identification number;corpus
: descriptor of the corpus used. It has the form(labels, no_below, no_above)
, with labels the list of labels considered when filtering the documents (relevant
andkeyword
orkeyword
alone).no_below
andno_above
have the same meaning as below;no_below
: no-below value;no_above
: no-above value;topics
: number of topics;alpha
: alpha value;beta
: beta value;coherence
: coherence score of the model;times
: time spent evaluating this model;seed
: seed used;uuid
: UUID of the model;num_docs
: number of document;num_not_empty
: number of documents not empty after filtering.
The command manage to set the PYTHONHASHSEED
to 0 so setting the seed
value is enough to have reproducible runs.
More information on the PYTHONHASHSEED
variable can be found here.
A slrkit project is a git repository, so it is possible to record the work done and exchange it using a remote repository.
Since the record
command tracks only the configuration of a project and the files that cannot be recreated directly using the slrkit commands, cloning/pulling a slrkit project requires some steps to recreate the missing files.
In particular the following command must be run:
import
: to recreate the abstracts file;journals filter
: to mark the excluded papers. This is mandatory if a journals file is present in the repository;preprocess
: to recreate the preprocess file used by the lda related commands.
After these commands the working directory is ready to run any lda related command.
The build
command executes these commands in order.
The slrkit.py
code, tries to auto-discover the configuration parameters of a script.
This is done using the ArgParse
class from the slrkit_utils.argument_parser
module of the slrkit_utils
repository.
This class works like the standard ArgumentParser
class of the argparse
python module, but it collects information about each argument and stores it in the slrkit_arguments
dictionary.
Using this dictionary slrkit.py
can find the names of each argument, its default value, if it is optional or required and all the other annotation.
With this information slrkit.py
can automatically create the default configuration files, and can easily pass the value in the configuration file to the command.
A script can be run as a command if it is adapted to do so.
First the script must be importable from the slrkit.py
code.
Second the script module must define a function named init_argparse
that does not take any arguments and returns the ArgParse
object used by the script itself.
Third the script information must be registered in the SCRIPTS
dictionary (see below);
Finally, the module must have a function (the name can be chosen freely) that accepts a argparse Namespace
as argument that execute all the logic of the script.
This Namespace
is the one returned by the ArgParse
object after the command line parsing.
The slrkit.py
code will use these features to handle and run the script.
The script code is imported by the slrkit.py
code.
The init_argparse
function and the ArgParse
object are used to handle the arguments of the script, create a default configuration file for the command and to handle the configuration file and prepare the arguments for the script.
The function with the logic of the script will be called by the slrkit.py
with all the required arguments.
This class is defined in the slrkit_utils.argument_parser
module of the slrkit_utils
repository.
The ArgParse
class is a subclass of the argparse.ArgumentParser
class that collects the information about the configured arguments.
The collected information are stored in the slrkit_arguments
attribute that is a dictionary where the key is the name of an argument and the value is a dictionary that contains all the collected information.
This is done with the overridden add_argument
.
This method collects standard information like:
- the name of the argument (stored as the key of
slrkit_arguments
); - the name of the destination of the argument read from the command line (stored as
dest
); - the default value (
value
); - the type (
type
); - the help string (
help
); - the
choice
keyword argument that is the collection of allowable values for this argument (choice
).
The overridden method can also accept some other custom attributes in the form of keyword arguments. They are:
input
: bool value, default False, flags an argument as an input file;output
: bool value, default False, flags an argument as an output file;non_standard
: bool value, default False, specifies that this argument must be handled in a special way (currently this attribute is not used);logfile
: bool value, default False, specifies that this argument is the path of a logfile;suggest_suffix
: str value, default None, suffix to suggest to the user for the value of this argument;cli_only
: bool value, default False, specifies that this argument is intended to be use on the command line only.
These attributes are stored in the argument dictionary usign their name as the key.
In addition, the required
attribute is stored in the dictionary.
This is a boolean value that tells if the argument is required or optional.
The action
attribute is also stored.
This is the Action
object used by the argument parser to handle the argument and to store the correct value of the argument.
This attribute can be used to store the argument value from the configuration file in the same way the argument parser does.
The input
attribute is used to detect which arguments are input files coming from other stages.
The output
attribute is used to identify which argument is the output file of a script.
The dependency system of the slrkit
command uses these attributes to correctly suggest the default names of the input and output files in the configuration files and to suggest which command must be run if one or more inputs are missing.
The file name suggestion in the configuration file also use the suggest_suffix
attribute.
If an argument has this attribute set, its value is used to create the default value used in the configuration file creation.
The default name will be <project name><suggest_suffix>
.
The logfile
attribute is used to mark the argument with the path to the log file in order to collect all the project logs in the log
directory inside the project configuration directory.
The slrkit.py
code creates the configuration files using the content of the slrkit_arguments
attribute of the ArgParse
object of each script that is configured as a command.
For each script argument not flagged as cli_only
or logfile
a corresponding entry is created in the configuration file.
The entry as the same name as the key of the slrkit_arguments
dictionary.
The value
value is used as the default value of each entry unless the suggest_suffix
is specified.
In that case the file name suggestion is performed as specified above.
For each entry, the text of the help
value of the slrkit_arguments
is provvided as a comment.
Moreover, a comment stating if the value is required or not is also produced.
In the slrkit.py
code, the SCRIPTS
dictionary stores the information regarding a script used as a command.
The key of this dictionary is the name of the command.
If a command as some sub-commands, the corresponding key value will be <command name>_<sub-command name>
.
Each entry of this dictionary has the following structure:
module
: name of the module of the script of the command without the.py
extension;additional_init
: boolean value that tells if this command requires additional actions to be performed during the project initialization. An example is theoptimize_lda
command that requires that theoptimize_lda_ga_params.toml
file to be copied in configuration directory and to update thega_param
entry of theoptimize_lda.toml
accordingly;depends
: list of the dependencies of the command;no_config
: boolean value taht tells if a command does not use a configuration file. If it isTrue
, the command not uses a configuration file, and so no configuration file is created by theinit
command.
The depends
list contains an element for each input file of the script that is produced by another command.
Each element is the name of the command that produces that file.
The order of each element must be the same of the corresponding input in the ArgParse
argument declaration.
For instance, if one script takes two inputs and first declared one depends on the output of the preprocess
command while the second one depends on the output of the terms generate
command, the corresponding depends
list will be ['preprocess', 'terms_generate']
.
The slrkit.py
code uses the depends
list in this way:
- the list of the inputs of a script (the argument flagged as
input
) is retrieved. The order of definition of each argument is preserved; - for each input, the corresponding entry (the entry with the same index) in the
depends
list is taken; - the entry is used to find the output (the argument flagged as
output
) of the command named in thedepends
entry on which this input depends; - this information is used both to provvide a default value for each input in the configuration files creation and to suggest which command must be run if an input is missing.
The commands listed in the SCRIPTS
dictionary are the only ones that are handled in the configuration file creation phase of the init
command.
The prepare_script_arguments
function handles the content of a configuration file and create the Namespace
with the arguments for a script.
The function takes the following arguments:
* config
: content of the config file;
* config_dir
: path to the config file directory;
* confname
: name of the config file;
* script_args
: information about the script arguments. This dictionary is the slrkit_arguments
attribute of the ArgParse
object of the script.
The function returns the Namespace
with the arguments values.
All the arguments are filled using the values in the configuration file.
The arguments flagged as cli_only
in script_args
are filled with the default value taken from script_args
.
The arguments flagged as logfile
is filled with a path to a log file in the log
directory in the configuration directory.
The arguments flagged as non_standard
are not processed by the function, and must be handled by the code that runs the command.
The prepare_script_arguments
function returns also a dictionary with the inputs and a dictionary with the outputs of the script.
These dictionaries have the name of the argument as the key and the value of the argument as the item.