Statistics Fundamentals — Quick Overview
Descriptive Statistics
Describes our collected data
Analyzing Quantitative Data
Four aspects of analyzing Quantitative data:
- Measures of Center
- Measures of Spread
- Shape of the data
- Outliers
Histograms is used for visualizing quantitative data. It is used to show frequency distribution.
Measures of Center
- Mean : Average of values
- Median : Median splits data into 50% of lower and 50% of higher values. SORT data to calculate median
- Mode : Most frequently observed data in dataset. There can be multiple modes or there can be no modes in the dataset
Which is best measure for center?
It totally depends on the situation. Median is generally better when we have outliers because it doesn’t get affected by precise numerical values of outliers.
Measures of Spread
Provide us an idea of how spread out data is from one another.
- Range : Difference between maximum and minimum
- Interquartile Range (IQR) : Difference between Q3 and Q1
- Standard Deviation : Average distance of each observation from mean
- Variance
Most common way to measure the spread is 5 Number Summary and Standard Deviation.
For symmetrical distribution data, the best way to measure spread of data is standard deviation while for asymmetrical data box plot and histograms work best
5 Number Summary:
- Minimum: Smallest number in dataset
- Q1: 25% of value fall below Q1
- Q2: 50% of value fall below Q2 (or median)
- Q3: 75% of value fall below Q3
- Maximum: Maximum number in dataset
Standard Deviation: Distance of data from the average distance from mean.
Standard deviation is squared because it emphasizes the extremes whereas absolute difference assigns equal weight to spread of data. It also gives a positive value , so that sum is not zero.
Variance is average squared difference of each observation from the mean.
Standard Deviation is square root of variance. It gives single number showing the spread of data unlike 5 Number Summary.
Standard Deviation is used to compare spread of two different groups. Higher standard deviation is associated with high risk
Standard deviation is used more than Variance because it uses same units as dataset while variance uses squared units
Shape
We can identify the shape of our data from a histogram. Distribution of the data can be:
- Right-Skewed
- Left-Skewed
- Symmetric (Normally Distributed)
Outliers
Data far away from rest of the values. Use 5 Number Theory approach to deal with outliers instead of Standard Deviation or mean.
Inferential Statistics
Uses our collected data to draw conclusions for a larger population.
Population: Entire group of interest
Parameter: Numeric summary of population
Sample: Subset of population
Statistic: Numeric summary of sample
Machine Learning and Artificial Intelligence are aimed at using collected data to draw conclusions about entire population at an individual level.
References: