Skip to content

Task for Methods and Models for Multivariate Data Analysis

Notifications You must be signed in to change notification settings

ShumwayGordon/MMMDA_ITMO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Goal

Sampling of multivariate random variables

Dataset description

“Ozone Level Detection Data Set” was used in this work. Columns 'WSR_PK', 'WSR_AV' (wind speed ratio peak and average), 'T_PK', 'T_AV' (temperature peak and average), 'KI' (K-index), 'TT' (T-Totals), 'Precp' (Precipitation) were used as features. 'SLP' (sea pressure level) column was chosen as the target. All used variables are continuous.

Step 1. Choosing variables for sampling from dataset

First of all, 10 features were selected from the original dataset, 3 of which became target values, and the rest became predictors.

Predictors:

  • T85: continuous. T at 850 hpa level (or about 1500 m height)
  • U70: continuous. U wind - east-west direction wind at 700 hpa
  • HT50: continuous. Geopotential height at 500 hpa, it is about the same as height at low altitude
  • T70: continuous. T at 700 hpa level
  • T8: continuous. 8-th measured temperature of a day
  • V85: continuous. V wind - N-S direction wind at 850
  • KI: continuous. K-Index

Target:

  • T_AV: continuous. Average T
  • T_PK: continuous. Peak T
  • T0: continuous. First measured temperature of a day

Step 2. Sampling of chosen target variables

In the first part of the second step, it was necessary to sample the target values using the inverse transform method. In the first part of the second step, it was necessary to sample the target values using the inverse transform method. Figure 1 shows the histograms of the distributions of target values - the blue color represents the initial data from the dataset, and the orange color represents the generated values using inverse transform sampling.

Figure 1 - results of sampling target values using inverse transform method (blue - original data, orange - sampled data)

Next, in the second step, it was necessary to sample the target values using the Accept-Reject method. To do this, it was necessary to select a function similar to the target value distribution and multiply this function by the Scale Factor so that the target value distribution is completely below it. Figure 2 shows the target function distribution values and the selected functions that are required for the Accept-Reject method.

Figure 2 - target functions distributions values t(x) (blue) and chosen functions h(x) (orange); ‘M‘ in the title stands for the Scale Factor value
  Figure 3 shows histograms of the distributions of the initial and generated target values.

Figure 3 - results of sampling target values using accept-reject method
  ## Step 3. Estimation of relations between predictors and chosen target variables

At the third step, it was necessary to assess the relationship between target and predictor values. For this, a heatmap (Figure 4) was built with the corresponding values of the correlation between all the considered quantities.

Figure 4 - heatmap of absolute correlation coefficients for all considered values

Step 4. Building a Bayesian network for a chosen set of variables. Structure is based on multivariate analysis.

At the fourth step, a Bayesian network was built based on a multivariate analysis of the selected features. The data for analysis was taken from the previous step, namely from the heatmap with correlation values. Figure 5 shows a graph that reflects the structure of the constructed Bayesian network.

Figure 5 - Graph of Bayesian network with a structure based on multivariate analysis

Step 5. Building a Bayesian network for the same set of variables using 2 algorithms for structural learning

The fifth step was to build a Bayesian network using algorithms for structural learning. The first network was built using K2 and the Hill Climb algorithm. Figure 6 shows a graph that reflects the structure of the constructed Bayesian network.

Figure 6 - Graph of BN made with using Hill Climb algorithm and K2

The second network was built using the evolutionary algorithm and MI. Figure 7 shows a graph that reflects the structure of the constructed Bayesian network.

Figure 7 - Graph of BN made with using Evolutionary algorithm and MI

Step 6. Analyzing quality of sampled target variables from the point of view of synthetic generation

At the sixth step, it was necessary to analyze the quality of the sampled target variables, for which the histograms of the initial and generated data were built. Figure 8-9 shows histograms of the original and synthetically generated target values.

Figure 8 - Results of sampling target values using Bayesian network made with multivariate analysis

Figure 9 - Results of sampling target values using Bayesian network made with structure-learning methods

Conclusion

As a result of the work, various sampling methods were investigated and implemented for the ozone dataset. In the course of comparing the results of sampling by various methods, it was revealed that each of the methods made it possible to obtain high-quality synthetic data for the dataset under consideration.

About

Task for Methods and Models for Multivariate Data Analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published