Distributions enable Benerator to generate numbers with desired distribution characteristics or following certain a sequence. A distribution may also be applied to groups of data objects to provide them with certain distribution characteristics.
Distributions come in two flavors:
- Sequences: Algorithm for generating numbers
- Weights: Functions which provide the probability of a certain number
For most common needs there exist predefined sequences and weights.
For special needs, you can define and add your own custom ones.
A Distribution is selected with a
distribution attribute an can be parameterized with a
max value and a
granularity. The granularity is applied in a way,
that any generated number is
min plus an integer multiple of
Thus, a configuration
<attribute name="price" type="double" distribution="increment" min="0.25" max="100" granularity="0.25"/>
yields the numbers
0.25, 0.50, 0.75, 1.00, ..., 99.75, 100.00
A Sequence is basically a number generator. It can provide a custom random algorithm, a custom weighted number generator or a unique number generation algorithm.
The Sequences used most often are
A weight function basically is a mathematical function that tells which weight
to apply to which number.
The most frequently used weight functions are
WeightedNumbers is a special component for creating a small set of numbers
based on a weighted-number literal, for example
1^70, 3^30 for generating 70%
1 values and 30%
This is a very convenient and simple approach for controlling parent-child cardinalities in nested data generation.
<attribute name="n" type="int" distribution="new WeightedNumbers('1^70,3^30')"/>
When using WeightedNumbers to determine the cardinality of an Entity part which is a container,
then the container type must be declared. Typical settings are
or, in some cases,
<part name='y' container='array' countDistribution="new WeightedNumbers('0^70,1^20,2^10')"> <attribute name='z' pattern='AAA'/> </part>
Distributing other data than numbers
'Other data' usually comes from a data source and is imported by an
<attribute name="code" type="string" source="codes.csv"/>
When iterating through data (e.g. imported from file or database), Benerator's default behavior is to serve each item exactly once and in the order as provided. When the end of the data set is reached, Benerator stops.
cyclic="true" Benerator serves the imported data consecutively too
but does not stop when it reaches the end. Instead, it restarts iteration.
Beware: For SQL queries this means that the query is reissued, so it may have a different result set than the former invocation.
<attribute name="code" type="string" source="codes.csv" cyclic="true"/>
But that is not really a distribution. We can do better and get probability effects:
When importing data from data sources, you can specify weights. They are different when importing simple data or entities:
Importing primitive data weights
When importing primitive data from a CSV file, each value is expected to be in an extra row. If a row has more than one column, the content of the
second column is interpreted as weight. If there is no such column, a weight of 1 is assumed. Benerator automatically normalizes over all data
objects, so there is no need to care about manual weight normalization. Remember to use a filename that indicates the weight character, using a suffix
If you, for example, create a CSV file
customer,7 clerk,2 admin,1
and use it in a configuration like this:
<generate type="user" count="100"> <attribute name="role" source="roles.wgt.csv" /> </generate>
this will create 100 users of which about 70 will have the role
clerk and 10
Alternative Delimiters for importing weights
By default, the semicolon is the delimiter between commands: Benerator splits imports commands by their delimiter. The
default separator can be overwritten by the property
<generate type="user" count="100"> <attribute name="role" source="roles.wgt.csv" separator="|" /> </generate>
It is also possible to specify the separator for the whole project in your
<setup> node as
<setup defaultSeparator="|"> <generate type="user" count="100"> <attribute name="role" source="roles.wgt.csv" /> </generate> </setup>
Weighing imported entities by attribute
When importing entities, one entity attribute can be chosen to represent the weight
Remember to indicate, that the source file contains entity data by using the correct
file suffix, e.g.
Example: If you are importing cities and want to weigh them by their population,
you can define a CSV file
name,population New York,8274527 Los Angeles,3834340 San Francisco,764976
and e.g. create addresses with city names weighted by population, when specifying
<generate type="address" count="100" consumer="ConsoleExporter"> <variable name="city_data" source="cities.ent.csv" distribution="weighted[population]"/> <id name="id" type="long" /> <attribute name="city" script="city_data.name"/> </generate>
Distributing unweighted Data
If the imported data does not come with weight information, you can apply a Distribution to control probability:
<attribute name="code" type="string" source="codes.csv" distribution="random"/>
For WeightFunctions, all available data is loaded into RAM and then the Weight Function's number generation feature is used to generate indices of the data items.
Most Sequences implement data distribution as described above for Weight Functions, but can be programmed individually for each Sequence.
Attention: Most distributions load all available data to distribute into RAM.
Most sequences should not be applied to data sets of more than 100.000 elements, a weight function should be restricted to at most 10.000 elements.
'Unlimited' Sequences which are suitable for arbitrarily large data sets are