The executable can be found at bin/benchmark-generator.jar
. To execute it, make sure that you use Java 1.8 or later. The usage is as follows:
java -jar bin/benchmark-generator.jar
-f,--format <arg> The format of output files [txt]. Default: "txt".
-h,--help Prints the help.
-i,--input <arg> A single TeX file or a directory which is scanned recursively for TeX files.
-o,--output <arg> The output file (if the input is a single TeX file) or the output directory.
-p,--prefix <arg> Only consider TeX files which starts with the given prefix(es).
-r,--role <arg> Only extract the logical text blocks with given semantic role(s).
[abstract, affiliation, authors, acknowledgements, body, caption, figure,
footnote, formula, heading, listing-item, table, title, reference]
-s,--suffix <arg> The suffix to use on creating the ground truth files. Default: ".txt"
The parameters --input
and --output
are mandatory.
For parameters --prefix
and --role
, you can define multiple values by setting the parameters multiple times. For example, if you wish to extract title and abstract, type
java -jar bin/benchmark-generator.jar --input [...] --output [...] --role title --role abstract
There is also a Makefile, defining a rule benchmark
that calls the executable with values adapted to our project:
BENCHMARK_GENERATOR_JAR = ./bin/benchmark-generator.jar
INPUT = ../benchmark/src/
OUTPUT = ../benchmark/groundtruth/
SUFFIX = ".body.txt"
ROLE = "body"
FORMAT = "txt"
benchmark:
@java -jar $(BENCHMARK_GENERATOR_JAR) \
--input $(INPUT) \
--output $(OUTPUT) \
--suffix $(SUFFIX) \
--role $(ROLE) \
--format $(FORMAT)
Call it by typing make benchmark
.
If you wish to compile the source codes, make sure your system fulfills the following prerequisites:
- Linux-based system.
- Git 2.5 or later.
- Java JDK 1.8 or later.
- Apache Maven 3.3 or later.
Further, you need the dependency commons
. Download it by typing
git clone https://github.com/ckorzen/commons.git
in a folder of your choice and install it by typing
cd commons
mvn -DskipTests install
Back to benchmark-generator/
, type
make compile-benchmark-generator
This will compile the source codes and recreate the executable bin/benchmark-generator.jar
.
Basically, the benchmark generator is separated into three subfolders.
bin
contains the executable, as seen above.tex-parser
contains the source codes to parse TeX files syntactically.
The TeX parser is based on a TeX grammar and built with JavaCC.
The JJ file is given by TeXParser.jj
tex-paragraph-parser
contains the source codes to (1) interpret the parsed TeX files in order to identify the logical text blocks and (2) serialize them to files.
The interpreter, given by TeXParagraphsIdentifier.java
, is rule-based.
Basically, the rules define how to interpret particular TeX commands and how they affect the logical text blocks.
They are given by element-references.csv
.
The serialization of logical text blocks is done by the given serializers in the serializer
folder. Currently, the logical text blocks can only be serialized to plain text. The formats json and xml will be added soon.