R Details Flashcards

1
Q

How can you join vectors together?

A

Using the names of the data sets/vectors you want to add- eg girls and boys this is the code
> children =c(girls,boys)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do you check the length of the vector

A

> length(vector name)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

When adding vectors what is key to remember?

A

Don’t put + signs - only commas

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to extract particular elements/numbers from a vector / data set?

A

> nameofvector[1]
The square brackets tell r where in the vector you want to be shown
A range of elements is written
nameofvector[1:7]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you see a vector without certain elements

A

> nameofvector[-1]
Minus the first element

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Maximum value of the vector

A

> max(vectorname)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to work out if any vectors match our number

A

> which(vectorname==7)
Will give you the position of those values that match

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Change the name of a vector

A

Vector= nameofvector

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How to calculate the sum of all elements

A

> sum(vector)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Mean of elements

A

> mean(vector)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Median of elements

A

> median(vector)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Variance of elements

A

> var(vector)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Standard deviation

A

> std = function(x) sqrt(va(x))
std(vector)
You have to teach r how to calculate standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Normality test example

A

Shapiro- wilks test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

When should you use Shapiro- wilks

A

To answer the null hypothesis: the data is drawn from a normal population
The p-value is the probability that our data are normal
A low p value lower than 0.05/5% allows us to reject the null hypothesis - meaning the alternative is true - the data is not normal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What do you do if your data is not Normal?

A

Calculate a non-parametric measure of data spread eg interquartile range
>IQR(vector)
Or
Median average deviation (MAD)- this finds the median of the absolute differences from the median and then multiples by a constant (1.4826)- which makes it comparable with the standard deviation
>mad(vector)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the code for summary and what does it show you?

A

> summary(vector)
Reports:
Minimum
Maximum
Median
Mean
1st quartile
3rd quartile

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How do you graphically show that random data is approx normal?

A

“Normal probability plot”
Any curving will show that the distribution has short or long tails
The line is drawn through points formed by the 1st and 3rd quartiles
>qqnorm(vector,main=“normal (0,1)”)
>qqline(vector)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What does a data transformation do?

A

Attempts of approximate normality before parametric stats can be applied
If data cant be converted to normality non parametric stats have to be used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Common data transformation process

A

Logarithm of the data - log(x+1)
>qqnorm(log(vector+1))
>qqline(log(vector+1))
Test if it worked with a normality test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Barcharts in r

A

> barplot(vector)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How to generate a more informative barplot

A

> table(vector)
barplot(table(vector))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How to change the scale on a barplot

A

> barplot(table(vector)/valuemeasured(vector))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How to add labels to a barplot

A

> labels=as.vector*(c(“one”, “two”,”three”))
barplot((table(vector)/measurement(vector)), names.arg=labels , xlab**=“Number of children”, ylab=“relative frequency”)
*actually write as.vector here
**label for x axis etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Histogram code

A

> hist(vector)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How to upload a larger data set

A

> dataset = read.table(“name of file”, header = TRUE)
attach(dataset)
dataset*
*this will show you the attached data set
summary(dataset)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Binomial or chi squared

A

Nominal or frequency data
2 categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Chi-squared

A

Nominal or frequency
More than 2 categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Pesaron product moment / spearman rank

A

Interval or ratio data and measures with a reasonably normal distribution

2 conditions

Testing hypotheses about:
Correlation - relationship between two dependent variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Simple linear regression

A

Interval or ratio data and measures with a reasonably normal distribution
2 conditions
Testing hypotheses:
Regression - effect of an independent variable upon a dependant variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

T test

A

Interval or ratio data and measures with a reasonably normal distribution

2 conditions

Testing hypotheses about
Means
Independent measures design

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

T test

A

Interval or ratio data and measures with a reasonably normal distribution

2 conditions

Testing hypothesis about- means
Matched measures or repeated measures designs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Analysis of variance - ANOVA
Parametric

A

Interval or ratio data and measures with a reasonably normal distribution

More than 2 conditions

Testing hypotheses about - means
Difference between means
Null hypotheses= there is no significant difference between the means of two conditions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Multiple linear regression

A

Interval or ratio data and measures with a reasonably normal distribution

More than 2 conditions

Testing hypotheses about - regression (effect of 2 or more independent variables upon a dependant variable)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Spearman rank

A

Ordinal data or non-normal distribution of measure

2 conditions

Testing hypotheses - correlation - relationship between two dependent variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Mann- Whitney

A

Ordinal data or non-normal distribution or measure

2 conditions

Testing hypotheses - medians

Independent measures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Wilcoxon

A

Ordinal data or non normal distribution of measure

2 conditions

Testing hypotheses about - medians
Repeated measures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Krystal - Wallis

A

Ordinal data or non-normal distribution of measure

More than 2 conditions

Non-parametric analysis of variance

Independent measures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Friedman

A

Ordinal or non-normal distribution of measure

More than 2 conditions

Non- parametric analysis of variance

Repeated measures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Continuous variable

A

Take on any value within a given range

There are an infinite number of possible values, limited only by our ability to measure them eg distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Discrete variable

A

Only certain distinct values within a given range

The scale is still meaningful - cant have half numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Categorical variable

A

One in which the value taken by the variable is a non numerical category or class

43
Q

Ranked variable

A

Is a categorical variable in which the categories imply some order or relative positive

Numerical values are usually assigned but 4 is not necessarily twice as many as two

44
Q

How to set class intervals

A
  1. Use intervals of equal length with midpoints at convenient round numbers
  2. For small data sets, se a small number of intervals
  3. For large data sets , use more intervals
45
Q

Stem leaf plots

A

Allow a summary of the data , retaining the original values
1. Stem consists of a column of figures, omitting the last digit
2. Add the final digit of each weight in the final row
3. Put the “leaves” in order

46
Q

Interquartile range IQR

A

Based on the median

Divides the data into four equal groups and looks to see how far apart the extreme groups are

  1. Put the data in numerical order
  2. Find the overal median. Divide the data set in two subsets with an equal size. If n for the whole set of data is odd, out the overall median in both subsets
  3. Find the median for the lower groups . This is the first quartile
  4. Find the median for the upper group. This is the third quartile

Interquartile range is IQR=Q3-Q1

47
Q

What is a box whisker plot

A

A way to illustrate the IQR

A good way to demonstrate the differences between groups

48
Q

Standard deviation

A

A measure of spread around the mean
A bit like the average of the data from the mean

49
Q

Random variable?

A

Is the numerical outcome of a random experiment

50
Q

Binomial distribution

A

Variable is one with just two possible outcomes eg a single toss of a coin

  • one outcome = success and the other= failure
51
Q

What are the four attributes of the normal distribution?

A

Variously -
wide and flat
Or
Narrow and high

52
Q

Chi- squared test

A

Suitable for frequency data: counts of things
Do the number of individuals in different categories fit a null hypothesis of some sort (the expectation)

53
Q

Yates correction of 1df

A

Apply where there are only two categories of data (Eg. Male and female)

Substract 0.5 from each value of O-E ignoring the sign IO-EI-0.5
Continue rest of calculation as normal

54
Q

Mann-Whitney test- detailed

A

Non parametric alternative to the unpaired t-test

Tests for the significant difference between the median of two independent groups

Use this test when one or both groups have non-normal distribution

55
Q

Wilcoxon paired sample test- detailed

A

Non parametric alternative to the paired t-test

Tests for a significant difference between the medians of paired observations

Use this test when one or both groups have a non normal distribution (or cannot be induced to be normal)

56
Q

Krystal- Wallis

A

Non-parametric one-way analysis of variance

Non-parametric alternative to one way ANOVA

57
Q

Friedman’s

A

Non-parametric to way analysis of variance - alternative

Used to detect differences in medians between three of more treatments of the same subject

Wide variations of the standard deviations for rows or columns of a data matrix suggest that we cannot use parametric ANOVA

58
Q

Parametric stats

A

Based on assumptions about the distributions of population from which the sample was taken

Evaluate hypotheses for a particular parameter usually the population mean
Quantitative data
Require assumptions about the distributional characteristics of the population distribution
- normal data
- equal variance

More powerful than non parametric test when assumptions are met

59
Q

Non parametric stats

A

Evaluate hypotheses for entire population distributions
Quantitative,ranked qualitative data

Require no assumptions (distribution free) so used with non normal distributions and when variance of the groups are not equal

Generally easy to compute

60
Q

List of parametric tests

A

Paired t test
Unpaired t test
Pearson correlation
ANOVA

61
Q

Non parametric test- examples

A

Wilcoxon rank sum test

Mann-Whitney U test

Spearman correlation

Kruskal Wallis test

Friedman

62
Q

Hierarchical clustering - what is it

A
  1. A way to find hierarchical patterns of similarity between sets of objects
  2. Not a test. There is no null hypothesis. No assumption about the distribution of the data
63
Q

When to use it hierarchical clustering?

A

You have objects or things described by a large number of continuous or discrete variables
Some implementations also work with ordinal or categorical variables

Allows you to visualise this graphically (dendogram or tree)

64
Q

Hierarchical clustering: three steps

A
  1. Data transformation (eg Z-scores)
  2. Matrix of similarities , differences or distances (eg Euclidean)
  3. Clustering algorithm (eg UPGMA, average neighbour)
65
Q

Principal component analysis (PCA)- what is it

A
  1. A data reduction technique
  2. Not a test. There is no null hypothesis. No assumption about the distribution of the data
66
Q

When to use principle component analysis (PCA)-
Ie which variables
- what do you explore
-what do you see

A

You have objects or things described by a large number of continuous or discrete variables (not ordinal or categorical)

You want to explore the differences between the objects as measured by all the variables simultaneously

Allows you to visualise this graphically (space- filling model)

67
Q

Multiple regression - what is it?

A
  1. An extension of linear regression to situations where there is more than one independent variable
  2. A data reduction technique. Seeks to explain a reasonable fraction of the variance in the dependent variable using only some of the independent variables
68
Q

When to use multiple regression?

A

You have objects or things described by a large number of continuous or discrete variables
These are distributed reasonably normally

69
Q

Why do you test for normality before performing a variance ratio test for the equality of variances

A

The variance test is sensitive to departures from normality

70
Q

Independent variables = random

A

Temperature = random

ANOVA

71
Q

Independent variables - fixed

A

Barely = 2 varieties

72
Q

Interval or ratio data and Measures with a reasonably normal distribution

A

categories ranked and have equal spacing between adjacent values

only ratio scaled have a true zeros
- zero is treated as a point of origin

73
Q

2 categories - which parametric tests

A

Binomial / chi squared
2 conditions:
Pearson product
Simple linear regression
T-test (unpaired and paired)

74
Q

Factorial analysis

A

Multiple hypothesis
describe variability among observed , correlated variables

75
Q

Describe graphs of both Pierson’s regression and spearman’s ranks

A

Piersons regression = straight line

Spearman’s rank= only has to be correlated

76
Q

Simple linear regression

A

A regression model that estimates the relationship between one independent variable and one dependent variable using a straight line

77
Q

Regression analysis

A

Reliable method of identifying which variables have an impact on a topic of interest.
the process of performing a regression allows you to confidently determine which factors matter most , which factors can be ignored and how the factors influence each other

78
Q

Tests that are used for more than 2 categories

A

Chi squared
ANOVA
Kruksal wallis
Friedman

79
Q

Multiple linear regression

A

Regression model that estimates the relationship between a quantitative dependent variable and 2 or more independent variables using a straight line

80
Q

Principle component analysis regression - when is it used

A

For variables that are strongly correlated

PCA technique is using in processing data where multi-linearity exists between the features / variables

81
Q

How to test for significant differences between medians of 2 paired observations
3 steps

A

1- calculated difference between groups
2- absolute differences
3- rank absolute differences

82
Q

Ordinal data meaning

A

Ordinal data violates the assumption of normal distribution
- categories within a variables that have a natural rank order

83
Q

Variance

A

Tells you the degree of spread in your data set

84
Q

Variance test - what does it do

A

Sees if the variance of 2 populations from which the samples have been drawn is equal or not

85
Q

SSyy = SSR + SSE

A

SSyy= variation explained by regression
SSR= regression
SSE= error

86
Q

How to calculate SSE

A

Sum of the squared estimate of errors

87
Q

The regression/least squares line…

A

Is the line with the smaller SSE

88
Q

What is regression analysis - equation

A

X= predictor variable
Y= response variable

Y= a+bx

89
Q

Regression analysis explained

A

Regression analysis is a set of statistical methods used for the estimation of relationships between a dependent variable and one or more independent variables

It can be utilised to assess the strength of the relationship between variables and for modelling the future relationship between them

90
Q

What is SSR

A

SSR is the additional amount of explained variability in Y due to the regression model compared to the baseline model

91
Q

What is multicollinearity and why is it a problem

A

It is the phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy

It is a problem because it undermines the statistical significance of an independent variable

92
Q

What does it mean if the standard error of a regression coefficient is large

A

The coefficient will be less statistically significant

93
Q

What is PCA

A

Is a tool for exploring the structure of multi variate data

Data reduction technique
- allows us to reduce the number of variables to a manageable number of new variables or components

94
Q

Limitations of PCA

A

Variables must be continuous or on an interval scale

95
Q

Two types of PCA

A

Covariance matrix - applies more weigh to some variables than others

Correlation matrix - expression each variable with equal weight

96
Q

1 sample t test

A

1 mean value is significantly different to a set mean

97
Q

2 sample t test

A

Test whether unknown population means are equal or not

98
Q

Unpaired t test

A

2 different categories eg different weights of lemurs
Or how many carrots boys and girls eat

99
Q

Paired test

A

Speed of a human wearing 1 type of shoe compared to another

The measurement use be paired due to different running speeds of humans no matter the shoe type

100
Q

One tailed

A

Only one way the results can go - directional results
The area of distribution is (for example) greater than the value specified in the null hypothesis

101
Q

Two tailed

A

Critical area of distribution is two sided and tests whether a sample is greater than or less than a certain range of values

Group higher scored higher or lower than Group B

102
Q

Types of hierarchical clustering

A

Pubs in two towns = pre defined clusters (already close together)
Geographical mid point of all Swindon pubs and the mid point of bath pubs and measure that distance = centroid clustering
Average distance between every pub in Swindon and every pub in bath = average linkage clustering
Closest pair of pubs one from each town = single linkage or nearest neighbour clustering
Take the most distant pair = complete linkage clustering

103
Q

Distance matrix

A

Condense univariate distances down to a single number
Add them up: manhattan / duty block distance

Or
Euclidean distance = square root of the sum of the squares of univariate distances

104
Q

Requirements of mannwhitney

A

Rank observations as if they were single sample - eg smallest to largest