-
Notifications
You must be signed in to change notification settings - Fork 7
Exploratory data analysis
Jona-Dervishi edited this page Dec 11, 2019
·
11 revisions
- Dataframe contains 1449 rows, where each row corresponds to a building
- Dataframe contains 6 columns (site_id, building_id, primary_use, square_feet, year_built, floor_count)
- Year_built and floor_count contain missing values encoded as NaNs: 774 and 1094 respectively
Try imputing missing values using:
-
sklearn
from sklearn.impute import SimpleImputer, IterativeImputer
from sklearn.ensamble import ExtraTreesRegressor
from sklearn.linear_model import BayesianRidge
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
imp = Imputer_name(missing_values=np.nan, strategy='mean, median, most_frequent, zero, knn, mice, hot-deck')
To validate each imputer we used:
from sklearn.model_selection import cross_val_score
-
fancyimpute
from fancyimpute import KNN
KNN from fancyimpute works fine for smaller datasets, but it is not suitable for bigger dataset since it builds an adjacency matrix, which exhausts the allotted resources.
-
SimpleImputer
algorithm is univariate, which imputes values in the i-th feature dimension using only non-missing values in that feature dimension. -
IterativeImputer
is a multivariate imputation algorithms that uses the entire set of available feature dimensions to estimate the missing values. -
ExtraTreeRegressor
fits a number of randomised decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. -
BayesianRidge
fits a Bayesian ridge model by optimising the regularisation parameters lambda (precision of the weights) and alpha (precision of the noise). -
KNeighborsRegressor
predicts the target by local interpolation of the targets associated of the nearest neighbors in the training set.
- Dataframe consists of 139773 rows and 9 columns (site_id, timestamp, air_temperature, cloud_coverage, dew_temperature, precip_depth_1_hr, sea_level_pressure, wind_direction, wind_speed)
Missing data per feature:
Feature | Nr. of missing data |
---|---|
air_temperature | 55 |
cloud_coverage | 69173 |
dew_temperature | 113 |
precip_depth_1_hr | 50289 |
sea_level_pressure | 10618 |
wind_direction | 6268 |
wind_speed | 304 |