Descriptive statistics are numbers that are used to describe and summarize the data. They are used to describe the basic features of the data under consideration. They provide simple summary measures which give an overview of the dataset. Summary measures that are commonly used to describe a data set are measures of central tendency and measures of variability or dispersion.
Measures of central tendency include the mean
, median
and mode
. These measures summarize a given data set by providing a single data point. These measures describe the center position of a distribution for a data set. We analyze the frequency of each data point in the distribution and describes it using the mean, median or mode. They provide the average of a data set. They can be either a representation of entire population or a sample of the population.
Measures of variability or dispersion include the variance
or standard deviation
, coefficient of variation
, minimum
and maximum
values, IQR
(Interquartile Range), skewness
and kurtosis
. These measures help us to analyze how spread-out the distribution is for a dataset. So, they provide the shape of the data set.
When using the describe function, there are two ways; categorical and numerical data. In case of numerical data, you'll get:
count
: Number of non-NA/null observationsmean
: The arithmetic averagestd
: The standard deviationmin
: The smallest (minimum) value25%
: The first quartile (25th percentile)50%
: The median (50th percentile)75%
: The third quartile (75th percentile)max
: The largest (maximum) value
But if you have categorical data, you'll see:
count
: Number of non-NA/null observationsunique
: Number of unique valuestop
: The most common value (the mode)freq
: The frequency of the most common value
Central tendency means a central value which describe a probability distribution. It may also be called a center or location of the distribution. The most common measures of central tendency are mean, median and mode.
- The most common measure of central tendency is the mean.
- For skewed distribution or when there is concern about outliers, the median may be preferred. So, median is more robust measure than the mean.
Dispersion is an indicator of how far away from the center, we can find the data values. The most common measures of dispersion are variance, standard deviation and interquartile range (IQR).
- Variance is the standard measure of spread.
- The standard deviation is the square root of the variance.
Now, we will take a look at measures of shape of distribution. There are two statistical measures that can tell us about the shape of the distribution. These measures are skewness and kurtosis. These measures can be used to convey information about the shape of the distribution of the dataset.
- Skewness is a measure of a distribution's symmetry or more precisely lack of symmetry.
- It is used to mean the absence of symmetry from the mean of the dataset.
- It is a characteristic of the deviation from the mean.
- It is used to indicate the shape of the distribution of data.
The rule of thumb for skewness values are:
- If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.
- If the skewness is between -1 and – 0.5 or between 0.5 and 1, the data are moderately skewed.
- If the skewness is less than -1 or greater than 1, the data are highly skewed.
- Kurtosis is the degree of peakedness of a distribution.
- Data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly and have heavy tails.
- Data sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak.
- The reference standard is a normal distribution, which has a kurtosis of 3.
- Often, excess kurtosis is presented instead of kurtosis, where excess kurtosis is simply kurtosis - 3.
-
Adam Hayes. (2023). Descriptive Statistics: Definition, Overview, Types, Example. Investopedia.
-
Prashant Banerjee. (2019). Descriptive Statistics with Python. GitHub Gist.
-
Simranjeet Singh. (2023). The Ultimate Guide to Statistics: Part 1 - Descriptive Statistics. Towards AI.
If you have any question, feel free to send me a message on LinkedIn.