Due to the size of profiles created in the 100Gb environments, it is necessary to perform some sort of sampling. However, sampling biflows in the profile randomly is likely to change some important traffic characteristics. Namely:
- packets to bytes ratio
- biflows to packets ratio
- biflows to bytes ratio
- IPv6 to IPv4 ratio
- proportional representation of L4 protocols in the profile (top 5)
- proportional representation of source and destination ports in the profile (top 10)
- proportional representation of average packet sizes in biflows
The profile sampling is implemented using a genetic algorithm to find a sample (further referred to as solution) of the profile in which none of the specified characteristics (further referred to as metrics) deviate more than a certain percentage from the original profile.
Each individual in the population is encoded as a bitmap where each index indicates whether the biflow in the profile with such index is present in the solution or not.
At the beginning of the genetic algorithm, several individuals are initialized randomly. This is done by putting random number of ones (with respect to the provided lower and upper bounds) into the bitmap of each individual.
The genetic algorithm runs for configurable number of generations or until acceptable solution is discovered. Each generation consists of the following stages:
- Selection - select parents from the current population for mating
- Crossover - parents mate producing the same amount of offsprings
- Mutation - all offsprings mutate some of its genes
- Repair - add resp. remove some of the offspring genes if they have too few resp. too many genes
- Fitness - compute fitness value for every offspring
- check if any of the offsprings can be accepted as a solution to our problem
- offsprings become the parents in the next generation
Fitness value of an individual is set to 100 minus percentual deviation of every metric from the original metrics. Solutions which have superb results in some metrics and terrible in other metrics are considered worse than solutions where all metric deviations are mediocre. To further penalize solutions where some metrics deviate more than 1% from the original, all metric deviations are squared. Generally, it is expected that deviations start to get below 1% for solutions with fitness above 94.
Selection of individuals for the next generation is done using stochastic universal sampling which increases the chance for individuals with worse fitness value to survive the selection process. This reduces evolution pressure and decreases the chance of the algorithm to get stuck in a local optimum.
Crossover is done from 4 individuals using standard two-points strategy.
Mutation operator exhibits the most influence on the outcome of the genetic algorithm. It shuffles random part of the genome of the solution. Shuffling is perfect in this case as it does not introduce new biflows into the solution, thus the number of biflows stays within the sampling bounds. The mutation pressure is determined by the average fitness value in the population. If the average fitness value is < 80, more genes will be mutated in an attempt to find more diverse solutions. If the average fitness value is >= 80, mutate fewer genes to incrementally improve solutions in the population.
The algorithm runs for a fixed number of generations or until it finds a solution which satisfies the criteria for acceptance.
Start the profile sampler by running ft-profile-sampler <args>
.
Arguments:
-h, --help show this help message and exit
-u VALUE, --max-sampling VALUE
maximum sampling value (mandatory, must be between 0 and 1)
-l VALUE, --min-sampling VALUE
minimum sampling value (mandatory, must be between 0 and 1, lower than maximum sampling value)
-i FILE, --input FILE
path to a CSV file containing the input profile (mandatory)
-o FILE, --output FILE
path to a file where the sample should be written (mandatory)
-m FILE, --metrics FILE
path to a file where metrics of the result should be (mandatory)
-d VALUE, --deviation VALUE
acceptable deviation of each key metric from the original profile metric
(must be between 0 and 1, default: 0.005)
-s VALUE, --seed VALUE
seed for the random number generator to reproduce specific run
-g VALUE, --generations VALUE
number of generations (default: 500)
-p VALUE, --population VALUE
population size (default: 16)
-q, --quiet do not print any runtime information
-t, --port-limit VALUE
Omit ports which proportional representation in the profile is less than
a threshold when calculating fitness (must be between 0 and 1, default: 0.005).
-r, --proto-limit VALUE
Omit protocols which proportional representation in the profile is less than
a threshold when calculating fitness (default: 0.005).