Section A.2: Univariate analysis Flashcards
Univariate Analysis
Univariate analysis is the simplest form of data analysis. It involves analysing one variable in a dataset to understand its characteristics
Dataset
A dataset is a multidimensional (multiple columns) heterogeneous (highly variable - many data types) data structure (a format for organising data)
Data structure
A specialised format for organising, processing, retrieving and storing data
Univariate data
Univariate data is a type of data which consists of observations on only a single characteristic or attribute, with no cause or relationship between variables.
Univariate data classifications (3)
1) ID - Used to uniquely identify a subject.
2) Numerical - Number based. Can be discrete or continuous
3) Categorical - Based on characteristics, can be ordinal or nominal.
Univariate analysis techniques (4)
Graphical
Tables
Descriptive statistics
Inferential statistics
Univariate data - Graphical analysis (3)
Histogram
Boxplot
Density curve
Histogram
A histogram displays the frequency of each value or group of values in numerical data
Boxplot
A boxplot summarises data based on the 5-number summary: First quartile, Median, Third quartile, Minimum, and Maximum
It is beneficial for identifying outliers in data
Bar chart
Frequency bar charts are a univariate chart used to find the frequency distribution of categories in categorical data
Pie charts
Frequency Pie charts are a univariate chart used to show the frequency distribution of categorical data based on “slices” indicating the share of each category
What are the two types of descriptive statistics?
Measure of central tendency - mean, median, mode
Measure of variability - range, IQR, variance, standard deviation etc.
What is descriptive statistics?
Descriptive statistics involves the generation of summary statistics from a sample of data, used to describe and gain insight into the features of the data set overall.
Define measures of central tendency
Measures of central tendency are statistical measures that use a single value to represent the central or typical value for a probability distribution. The three most common are mean, median, and mode.
Define measures of variability/dispersion
Measures of variability or dispersion are statistical measures that use a single value to represent the variability or dispersion of values in a data set from the central point. Univariate statistics such as: Range, IQ Range, variance, Quartiles, variance, and standard deviation, are common summary statistics used for this.
Define: Mean
The mean is a univariate summary statistic (measure of central tendency) that is calculated given the sum of all data points divided by the number of data points.
Define: Median
The median is a univariate summary statistic (measure of central tendency) that is the middle-most value when values of data points are ordered by their magnitude. (Highest to lowest, or lowest to highest)
Define: Mode
The mode is a univariate summary statistic (measure of central tendency) that is the most commonly observed value in a distribution. A distribution of data can have 0 or more modes.
Define: Range
The range is a univariate summary statistic (measure of variability) that is the
Define: Interquartile range
The interquartile range is the difference between the 75th (Q3) and 25th (Q1) percentiles of the data
Define: Quartile
A quartile is a quantile which divides the number of data points into four parts based on their values when ordered from lowest to highest
Define: Standard deviation
The standard deviation is the square root of the variance, it expresses how much the data points differ from the mean.
Define: Variance
The variance of the data is the average of the squared deviations from the mean.