Skip to content

Exploratory data analysis

Jona-Dervishi edited this page Dec 11, 2019 · 11 revisions

Building Metadata

  • Dataframe contains 1449 rows, where each row corresponds to a building
  • Dataframe contains 6 columns (site_id, building_id, primary_use, square_feet, year_built, floor_count)
  • Year_built and floor_count contain missing values encoded as NaNs: 774 and 1094 respectively
    Try imputing missing values using:
  1. sklearn
    from sklearn.impute import SimpleImputer, IterativeImputer
    from sklearn.ensamble import ExtraTreesRegressor
    from sklearn.linear_model import BayesianRidge
    from sklearn.tree import DecisionTreeRegressor
    from sklearn.neighbors import KNeighborsRegressor

    imp = Imputer_name(missing_values=np.nan, strategy='mean, median, most_frequent, zero, knn, mice, hot-deck')
    To validate each imputer we used:
    from sklearn.model_selection import cross_val_score

    Scikit-learn dokumentation

  2. fancyimpute
    from fancyimpute import KNN
    KNN from fancyimpute works fine for smaller datasets, but it is not suitable for bigger dataset since it builds an adjacency matrix, which exhausts the allotted resources.

Traits of used imputers

  • SimpleImputer algorithm is univariate, which imputes values in the i-th feature dimension using only non-missing values in that feature dimension.
  • IterativeImputer is a multivariate imputation algorithms that uses the entire set of available feature dimensions to estimate the missing values.
  • ExtraTreeRegressorfits a number of randomised decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
  • BayesianRidge fits a Bayesian ridge model by optimising the regularisation parameters lambda (precision of the weights) and alpha (precision of the noise).
  • KNeighborsRegressor predicts the target by local interpolation of the targets associated of the nearest neighbors in the training set.

Weather dataframe

  • Dataframe consists of 139773 rows and 9 columns (site_id, timestamp, air_temperature, cloud_coverage, dew_temperature, precip_depth_1_hr, sea_level_pressure, wind_direction, wind_speed)
    Missing data per feature:
Feature Nr. of missing data
air_temperature 55
cloud_coverage 69173
dew_temperature 113
precip_depth_1_hr 50289
sea_level_pressure 10618
wind_direction 6268
wind_speed 304
Clone this wiki locally