V4 Flashcards
descriptive statistics
definition statistics
statistics is the study of the collection, analysis, interpretation, presentation, and organisation of data
univariate analysis
single variable
bivariate analysis
multiple variables
name the different means
- arithmetic mean (used the most)
- geometric mean (# are more dispersed)
- harmonic mean
geometric mean
- a geometric mean is often used when comparing different items that have different numeric ranges
- finding a single “figure of merit for these items”
- so measurements can be “equal”
harmonic mean
situations involving rates and rations, it then provides he truest average
- for example 60 km/h
- > the variable needs to be part of the ratio (for example km)
important arithmetic mean parameters
trim() - fraction of observations to be trimmed from each end of x before the mean is computed -> remove extreme outliers
na.rm() - indicating whether NA values should be stripped before computation proceeds
Median
- the number separating the higher half of data from the lower half (advised to always use this)
- can be found by arranging all valued from lowest to highest value and picking the middle one
- in case of an even number of values, the median is then usually defined to be the arithmetic mean of the two middle values
- code : median()
important parameters mean
na.rm() - indicating whether NA values should be stripped before computation proceeds
mode
- value that appears most often in a set of data
- numerical value of mode is same as that of the mean and median in a perfect normal distribution (Gaussian distribution)
- but will be very different in highly skewed distributions
dispersion
- range of a data set is known as the difference between the min and max
- range() - gives min and max valued
- diff(range(x)) gives is the range of the input
Quantiles
- dividing ordered data into q essentially equal-sized data subsets is the motivation for q-quantiles
- median = half quantile
- quantiles are the data values marking the boundaries between consecutive subsets
- 9 different types of quantile computation
- quantile()
Variance
- variance measures the spread of a set of numbers
- variance of zero indicated that all the values are identical
- variance is always non-negative
- var()
standard deviation
- a measure that is used to quantify the amount of the variation or dispersion of a set of data values
- a standard deviation of 0 indicates data points tend to be very close to the mean
- sqrt(var(x)) or sd(x)
Outliers
- an observation point that is distant from other observations
- depends on variability in the measurement
- heavy-tailed distribution - kurtosis
- experimental error
- is an observation an outlier ? Subjective - Winsorising -> adjusting data point or change data to mean