Statistics Fundamentals — Quick Overview

Chetna Shahi

3 min readOct 5, 2021

Image Courtesy: https://studyonline.unsw.edu.au/blog/types-of-data

Descriptive Statistics

Describes our collected data

Analyzing Quantitative Data

Four aspects of analyzing Quantitative data:

Measures of Center
Measures of Spread
Shape of the data
Outliers

Histograms is used for visualizing quantitative data. It is used to show frequency distribution.

Measures of Center

Mean : Average of values
Median : Median splits data into 50% of lower and 50% of higher values. SORT data to calculate median
Mode : Most frequently observed data in dataset. There can be multiple modes or there can be no modes in the dataset

Which is best measure for center?

It totally depends on the situation. Median is generally better when we have outliers because it doesn’t get affected by precise numerical values of outliers.

Measures of Spread

Provide us an idea of how spread out data is from one another.

Range : Difference between maximum and minimum
Interquartile Range (IQR) : Difference between Q3 and Q1
Standard Deviation : Average distance of each observation from mean
Variance

Most common way to measure the spread is 5 Number Summary and Standard Deviation.

For symmetrical distribution data, the best way to measure spread of data is standard deviation while for asymmetrical data box plot and histograms work best

5 Number Summary:

Minimum: Smallest number in dataset
Q1: 25% of value fall below Q1
Q2: 50% of value fall below Q2 (or median)
Q3: 75% of value fall below Q3
Maximum: Maximum number in dataset

Standard Deviation: Distance of data from the average distance from mean.

Standard deviation is squared because it emphasizes the extremes whereas absolute difference assigns equal weight to spread of data. It also gives a positive value , so that sum is not zero.

Variance is average squared difference of each observation from the mean.

Standard Deviation is square root of variance. It gives single number showing the spread of data unlike 5 Number Summary.

Standard Deviation is used to compare spread of two different groups. Higher standard deviation is associated with high risk

Standard deviation is used more than Variance because it uses same units as dataset while variance uses squared units

Shape

We can identify the shape of our data from a histogram. Distribution of the data can be:

Right-Skewed
Left-Skewed
Symmetric (Normally Distributed)

Outliers

Data far away from rest of the values. Use 5 Number Theory approach to deal with outliers instead of Standard Deviation or mean.

Inferential Statistics

Uses our collected data to draw conclusions for a larger population.

Population: Entire group of interest

Parameter: Numeric summary of population

Sample: Subset of population

Statistic: Numeric summary of sample

Machine Learning and Artificial Intelligence are aimed at using collected data to draw conclusions about entire population at an individual level.

References:

Udacity

Edit description

classroom.udacity.com