Finding patterns in complex datasets, emphasize aspects we think are important.
When classifying data, descriptive statistics can be used to inform the choice of classification.
Operation | Nominal | Ordinal | Interval | Ratio |
Equality | x | x | x | x |
Counts/Mode | x | x | x | x |
Rank/Order | x | x | x | |
Median | ~ | x | x | |
Add/Subtract | x | x | ||
Mean | x | x | ||
Multiply/Divide | x |
Highlight the central feature in a dataset.
Give context to measures of central tendency.
Most useful for categorical data.
A useful tool to inspect numeric data
Probability of occurrence based on in a quantitative dataset.
Data rarely fits a normal distribution perfectly:
Near Normal
Skewed Normal
Highly Skewed
Often helpful to scale or aggregate a value by a unit of another value - time, area, population etc.
Allows us to account for confounding variables that mask or hide patterns in our data.
Highly correlated
No correlation
It isn't always straightforward to include multiple confounding variables. For example: COVID rates by age groups.
Can also allow us to compare between two or more variables in different units / scales.
Unsupervised:
Supervised:
Vancouver dissemination area populations
One of the simplest classification schemes.
Another of the simplest classification schemes.
Slightly more complex classification scheme.
More complex, data is split using the Jenks algorithm.
Informative to "experts", not accessible for all.
Supervised: User defines break values.
There are many classification methods that are a bit too complex to actually perform in this course.
Algorithm uses random steps to group data into clusters.
Used for automated detection of outliers.
Used for automated detection of outliers.
Fit training data to user defined categories.
Multiple trees (>100) can be averaged to increase performance and generalization.
One of the most complex methods, capable of predictions.