V4 Flashcards

descriptive statistics

1
Q

definition statistics

A

statistics is the study of the collection, analysis, interpretation, presentation, and organisation of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

univariate analysis

A

single variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

bivariate analysis

A

multiple variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

name the different means

A
  • arithmetic mean (used the most)
  • geometric mean (# are more dispersed)
  • harmonic mean
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

geometric mean

A
  • a geometric mean is often used when comparing different items that have different numeric ranges
  • finding a single “figure of merit for these items”
  • so measurements can be “equal”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

harmonic mean

A

situations involving rates and rations, it then provides he truest average

  • for example 60 km/h
  • > the variable needs to be part of the ratio (for example km)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

important arithmetic mean parameters

A

trim() - fraction of observations to be trimmed from each end of x before the mean is computed -> remove extreme outliers
na.rm() - indicating whether NA values should be stripped before computation proceeds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Median

A
  • the number separating the higher half of data from the lower half (advised to always use this)
  • can be found by arranging all valued from lowest to highest value and picking the middle one
  • in case of an even number of values, the median is then usually defined to be the arithmetic mean of the two middle values
  • code : median()
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

important parameters mean

A

na.rm() - indicating whether NA values should be stripped before computation proceeds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

mode

A
  • value that appears most often in a set of data
  • numerical value of mode is same as that of the mean and median in a perfect normal distribution (Gaussian distribution)
  • but will be very different in highly skewed distributions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

dispersion

A
  • range of a data set is known as the difference between the min and max
  • range() - gives min and max valued
  • diff(range(x)) gives is the range of the input
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Quantiles

A
  • dividing ordered data into q essentially equal-sized data subsets is the motivation for q-quantiles
  • median = half quantile
  • quantiles are the data values marking the boundaries between consecutive subsets
  • 9 different types of quantile computation
  • quantile()
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Variance

A
  • variance measures the spread of a set of numbers
  • variance of zero indicated that all the values are identical
  • variance is always non-negative
  • var()
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

standard deviation

A
  • a measure that is used to quantify the amount of the variation or dispersion of a set of data values
  • a standard deviation of 0 indicates data points tend to be very close to the mean
  • sqrt(var(x)) or sd(x)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Outliers

A
  • an observation point that is distant from other observations
  • depends on variability in the measurement
  • heavy-tailed distribution - kurtosis
  • experimental error
  • is an observation an outlier ? Subjective - Winsorising -> adjusting data point or change data to mean
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

skeweness

A
  • measure of the asymmetry of a distribution
  • long tail , fat tail
  • in R: library(psych) -> skew(x)
17
Q

Kurtosis

A
  • measure of the “peakedness” of the distribution
  • Mesokurtic - zero excess kurtosis
  • leptokurtic - positive excess kurtosis (higher)
  • platykurtic - negative excess kurtosis (lower)
  • in R: library(psych) -> kurtosi()
18
Q

when is distribution important ?

A
  • many models assume a normal distribution of the input distribution or the error term
  • shapiro-Wilk test of normality (5% error margin) - shapiro.test(x) p>0.05 normal distribution
  • > always good to plot/check distribution with other factors
19
Q

plot()

A
  • create a plot window and call points() on x
20
Q

points()

A
  • create a dot/line plot of the data in x
21
Q

axis()

A

axis(side, at(where), labels(what))

22
Q

image()

A
  • useful when you have a matrix

- create a heat map (2D plot) of the data

23
Q

boxplot()

A
  • produce box-and-whisker plots
  • notch - if the notches of two plots do not overlap this is ‘strong evidence’ that the two means differ
  • varwidth - boxes are drawn with widths proportional to the sqrt of # of observations of group
24
Q

histogram

A
  • hist()
  • freq - representation of frequencies
  • breaks - breakpoints between histogram cells
  • plot - should a histogram be plotted
25
Q

heatmap()

A
  • plot a matrix, rows that are similar are put together

- image plot combined with automated clustering and side colours (groups)

26
Q

par()

A

used to set or query graphical parameters
options - put .lab or .axis to specify what to apply on
- cex -> magnification
- family, font
- lab, las, lay -> line/label type
- mai, mar, mex -> different typed of margins
- marrow, mccollum -> multiple plots in matrix shapes
- pch -> size of point, will multiply everything for Pdf
- x-axis, xaxt -> control xaxis

27
Q

steps in plotting (coding)

A
# create an output device
png("my.png", width = 100, height = 100)
#setup parameters
par() -> set up everything you need
# create an empty plot
plot(x = c(0,100), y = c(0,100), type = 'n') 'n' makes empty plot
# add points/lines/arrows
points(mdata)
#save the plot the png
dev.off()
28
Q

make multiple plots simple (code)

A
par(mfrow = c(2,2)) # 4 plots in 2 rows 2 col
# create 4 seperate plots after
29
Q

more complex multiple plots

A
  • layout()
  • layout.show() to get a preview

nf

30
Q

apply and subset example means

A

apply(subset(mdata, select = c(“col1”, “col2”)), 2, mean)

31
Q

get means for a certain number of columns and use a column filter

A

apply(subset(mdata, Temp < 25, select = c(“col1”, “col2”)), 2, mean)