Benerator can be called with a Benerator file name as command line parameter, like
benerator test.ben.xml
If no file is specified, Benerator expects a file benerator.xml
in the current directory.
You can change default behavior by Java VM parameters, e.g.
benerator.bat -Dfile.encoding=iso-8859-1 -Djava.io.tmpdir="C:\temp" test.ben.xml
Validation can be turned off from the command line alternatively using a VM parameter:
mvn benerator:generate -Dbenerator.validate=false
or
benerator myproject.ben.xml -Dbenerator.validate=false
You can specify the following options on the command line:
Option | Description | Remarks |
---|---|---|
--version,-v | Display system and version information | |
--help,-h | Display help information | |
--list | List the available environments or systems. <type> may be env , db or kafka . |
|
--clearCaches | Clear all caches | |
--mode | Activate Benerator mode strict , lenient or turbo |
default is lenient |
--anonReport | Verify 'pct' percent of anonymized data and display an anonymization report. 'pct' is an integer, 100 for complete tracking | Enterprise Edition only |
The DbSnapshotTool creates a snapshot of a full database schema and stores it in a DbUnit XML file. It is invoked from the command line in Windows by calling
snapshot [VM-params] export-filename.dbunit.xml
or, on Unix and Mac OS X systems,
sh snapshot [VM-params] export-filename.dbunit.xml
If the export filename is left out, the snapshot will be stored in a file called snapshot.dbunit.xml
.
You need the following VM parameters to configure database access.
Use them like -Ddb.user=me
:
Parameter | Description |
---|---|
dbUrl | The JDBC URL of the database |
dbDriver | The JDBC driver class name |
dbUser | user name |
dbPassword | user password |
dbSchema | Name of the schema to extract (defaults to the user name) |
Benerator provides a Benchmark Tool to assess and compare the performance of a typical generation or anonymization approaches.
It is of special use for you if you want to assess the generation/anonymization performance of different hard- and software settings, like numbers of cores, operating system, Java virtual machine, system software configuration and Benerator Enterprise Edition's multithreading configuration.
The different benchmarks used perform a list of predefined typical generation and anonymization tasks.
To invoke the Benchmark Tool with standard settings, just open a text console or terminal and enter
benerator-benchmark
Then the benchmark runs for a few minutes and prints a measurement summary. In Benerator Consumer Edition, only single-threaded execution is supported, so the report will look something like this:
+---------------------------------------------------------------------------+
| Benchmark throughput of Benerator Community Edition 3.1.0-jdk-11 |
| on a Mac OS X 11.4 x86_64 with 8 cores |
| Java version 11.0.11 |
| OpenJDK 64-Bit Server VM 11.0.11+9 (AdoptOpenJDK) |
| Date/Time: 2021-09-17T10:41:43.510603+02:00[Europe/Berlin] |
| |
| Numbers are reported in million entities generated per hour |
+----------------------------------------------------------------+----------+
| Benchmark | 1 Thread |
+----------------------------------------------------------------+----------+
| gen-string.ben.xml | 37 |
+----------------------------------------------------------------+----------+
| gen-person-showcase.ben.xml | 26 |
+----------------------------------------------------------------+----------+
| anon-person-showcase.ben.xml | 31 |
+----------------------------------------------------------------+----------+
| anon-person-regex.ben.xml | 346 |
+----------------------------------------------------------------+----------+
| anon-person-hash.ben.xml | 386 |
+----------------------------------------------------------------+----------+
| anon-person-random.ben.xml | 576 |
+----------------------------------------------------------------+----------+
| anon-person-constant.ben.xml | 2,210 |
+----------------------------------------------------------------+----------+
In the header the system settings are reported, then each of the following rows
displays the benchmark name and its performance, measured in a million entities
generated per hour.
This means for example that the anon-person-constant.ben.xml
anonymizes
2,210 million = 2.210 billion data sets per hour running in a single thread.
The performance numbers above have been measured on a plain Macbook Air M1 of 2020.
For a Benerator Enterprise Edition installation running on a machine with several cores, the benchmark is executed for several characteristic executionMode settings in order to find the sweet spot of executionMode settings.
A benchmark runs on the same system with Benerator Enterprise Edition yields the following result:
+-----------------------------------------------------------------------------+
| Benchmark throughput of Benerator Enterprise Edition 3.1.0-jdk-11 |
| on a Mac OS X 11.4 x86_64 with 8 cores |
| Java version 11.0.11 |
| OpenJDK 64-Bit Server VM 11.0.11+9 (AdoptOpenJDK) |
| Date/Time: 2021-09-17T10:09:44.460364+02:00[Europe/Berlin] |
| |
| Numbers are reported in million entities generated per hour |
+------------------------------+----------+-----------+-----------+-----------+
| Benchmark | 1 Thread | 2 Threads | 4 Threads | 6 Threads |
+------------------------------+----------+-----------+-----------+-----------+
| gen-string.ben.xml | 243 | 379 | 809 | 747 |
+------------------------------+----------+-----------+-----------+-----------+
| gen-person-showcase.ben.xml | 88 | 162 | 249 | 193 |
+------------------------------+----------+-----------+-----------+-----------+
| anon-person-showcase.ben.xml | 83 | 165 | 241 | 187 |
+------------------------------+----------+-----------+-----------+-----------+
| anon-person-regex.ben.xml | 684 | 1,008 | 1,145 | 794 |
+------------------------------+----------+-----------+-----------+-----------+
| anon-person-hash.ben.xml | 923 | 1,344 | 1,142 | 1,187 |
+------------------------------+----------+-----------+-----------+-----------+
| anon-person-random.ben.xml | 1,250 | 1,655 | 1,254 | 1,274 |
+------------------------------+----------+-----------+-----------+-----------+
| anon-person-constant.ben.xml | 1,926 | 2,533 | 1,522 | 1,503 |
+------------------------------+----------+-----------+-----------+-----------+
Note that we not only have improved the performance of the Community Edition, but optimized the Enterprise Edition to be even several times faster than the Community Edition.
For your performance optimization in Enterprise Edition, note that with additional threads' comes additional performance, but after a certain level of concurrency is reached, performance does not improve or even may deteriorate seriously. This may have one or more out of several reasons:
-
Coordination and synchronization overhead
-
More congestion of threads waiting at bottlenecks
-
Having serious work load on more threads than CPUs are available: The more threads between a CPU has to switch back and forth, the more time is lost on each context switch and you may end up spending more time switching than working.
-
With more threads comes higher throughput, but also higher storage needs. When critical buffer size limits are exceeded, a system's processing capacity may go down significantly though the overall CPU load looks relatively low.
The sweet spot where you have optimum performance with low concurrency usually is where the number of threads equals the number of cores, or is only slightly larger. As you might guess from the performance, the test laptop has 4 cores. Actually, it has more, but its 4 high-performance cores are the only ones that matter for generation and anonymization performance.
The Benchmark Tool has some command line parameters to configure its test runs.
For a short summary, type benerator-benchmark --help
The general invocation format is
benerator-benchmark [options] [name]
Name is an optional benchmark name. When specified, only this benchmark is executed. When left out, all benchmarks applicable to the given configuration are called.
Example: In order to only call the gen-string
benchmark, type
benerator-benchmark [options] [name]
The command line options are as follows:
Option | Meaning | Default Setting |
---|---|---|
--ce | Run on Benerator Community Edition (CE) | true on CE |
--ee | Run on Benerator Enterprise Edition (EE) | true EE and only available there |
--minSecs n | Choose a workload to have the benchmark run at least n seconds | 10 |
--maxThreads k | Use only up to k cores for testing (only on EE) | a bit more than the number of reported cores |
--env <spec> | Runs the tests applicable to the specified system(s). <spec> may be an environment name, a system (denoted by environment#system) or a comma-separated list of these (without whitespace) | |
--mode m | activates Benerator mode strict , lenient or turbo |
lenient |
--list | lists the names of the predefined benchmarks | |
--help | print this help |
A --minSecs settings of 30 requires the benchmark to run with a workload that needs at least 30 seconds to process. It is advisable to choose times which are sufficiently large that the fix initialization time of Benerator has less impact on the measurement and that JVM hotspot optimizers get some time to make Benerator run even more efficiently. If --minSecs is not specified, a default of 10 seconds is used, which is too short for optimum measurements, but was chosen as a defensive measure to save your system's hard drive space: The file generation benchmarks could easily fill up a terabyte of hard disk space in a few minutes. Even on standard developer hardware, a file export of one minute can produce a file of a size of 10 Gigabytes. Generated files are deleted automatically after each benchmark run, but take care not to fill up your disk during a benchmark run.
By default, the Benchmark tool tests thread settings from single-threaded to a concurrency
slightly larger than the number of cores of the system it is running on. Unfortunately,
some systems report a higher number of cores than are available for our tests.
For example a Macbook Air M1 of 2020 reports to have 8 cores, but only 4 of them are high
performance cores, 4 are efficiency cores which do not contribute performance to
Benerator. So a setting of --maxTreads 6
makes the Benchmark go up only to 6 threads
instead of 10 threads it would have taken by default.
The reports above have been created using
benerator-benchmark --ce --minDurationSecs 30 --maxThreads 6
In order to assess database processing performance, you need to configure the relevant database(s) in an environment repository, see Environment Files for details.
For a quick start, the Benchmark Tool comes with two built-in databases h2
and hsqlmem
in an environment bultin
.
This will run the database benchmarks on the built-in h2:
benerator-benchmark --builtin#h2
In order to test all systems defined in the environment builtin
, specify just the environment name
benerator-benchmark --dev builtin
You can test your own databases in a similar manner by defining them in an environment file
(eg. local.env.properties
) and providing it in the command line options as above.
You will notice that the database benchmark may run significantly longer than the simple tests. This is caused by measuring two access types, 'read' and 'write' and requiring each of these to run at least 'minSecs' seconds. For some databases, writing is significantly slower than reading (up to a factor of 20), so that you need to write data for 10 minutes in order to read data for 30 seconds. The database benchmarks alleviate that a bit, by performing two reads for each write (effectively halving execution time), but still will take long time. So please be patient.
Testing Kafka performance is a bit tricky, so the Benchmark tool needs one dedicated topic per test. You can reuse pre-existing topics, but they must be empty when starting the tests. Otherwise, the benchmark may read pre-existing data leading to wrong performance metrics.
Currently, there are two Kafka benchmarks:
Name | required 'system' name | Description |
---|---|---|
kafka-small-entity | kafka_small_entity | Reads and writes messages with small entities in JSON format |
kafka-big-entity | kafka_big_entity | Reads and writes messages with big entities in JSON format (several KBs) |
A dev
environment file might look like this, and you can use it to map the system names to topic names
which are actually available on your Kafka cluster (dev.env.properties
):
kafka_small_entity.kafka.bootstrap.servers=localhost:9092
# use the following line to specify a topic for the kafka-small-entity benchmark
kafka_small_entity.kafka.topic=kafkaQueue1
kafka_small_entity.kafka.format=json
kafka_big_entity.kafka.bootstrap.servers=localhost:9092
# use the following line to specify a topic for the kafka-big-entity benchmark
kafka_big_entity.kafka.topic=kafkaQueue2
kafka_big_entity.kafka.format=json
The XMLCreator reads a XML Schema file and creates a number of XML files that comply with the schema. It can read XML annotations which provide benerator configuration in the XML schema file. It is invoked from the command line and has the following parameter order:
createxml <schemaUri> <root-element> <filename-pattern> <file-count> [<properties file name(s)>]
Their meaning is as follows:
-
schemaUri: the location (typically file name) of the XML schema file
-
root-element: the XML element of the schema file that should be used as root of the generated XML file(s)
-
filename-pattern: the naming pattern to use for the generated XML files. It has the form of a java.text.MessageFormat pattern and takes the number of the generated file as parameter {0}.
-
file-count: the number of XML files to generate
-
properties file name(s): an optional (space-separated) list of properties files to include in the generation process
Under Windows, an example call would be:
createxml myschema.xsd product-list products-{0}.xml 10000 perftest.properties
or, on Unix and Mac OS X systems,
sh myschema.xsd product-list products-{0}.xml 10000 perftest.properties
for generation 10,000 XML files that comply with the XML Schema definition in file myschema.xsd
and have product-list as root element. The files will be
named products-1.xml
, products-2.xml
, products-3.xml
, ...