R Details Flashcards
How can you join vectors together?
Using the names of the data sets/vectors you want to add- eg girls and boys this is the code
> children =c(girls,boys)
How do you check the length of the vector
> length(vector name)
When adding vectors what is key to remember?
Don’t put + signs - only commas
How to extract particular elements/numbers from a vector / data set?
> nameofvector[1]
The square brackets tell r where in the vector you want to be shown
A range of elements is written
nameofvector[1:7]
How do you see a vector without certain elements
> nameofvector[-1]
Minus the first element
Maximum value of the vector
> max(vectorname)
How to work out if any vectors match our number
> which(vectorname==7)
Will give you the position of those values that match
Change the name of a vector
Vector= nameofvector
How to calculate the sum of all elements
> sum(vector)
Mean of elements
> mean(vector)
Median of elements
> median(vector)
Variance of elements
> var(vector)
Standard deviation
> std = function(x) sqrt(va(x))
std(vector)
You have to teach r how to calculate standard deviation
Normality test example
Shapiro- wilks test
When should you use Shapiro- wilks
To answer the null hypothesis: the data is drawn from a normal population
The p-value is the probability that our data are normal
A low p value lower than 0.05/5% allows us to reject the null hypothesis - meaning the alternative is true - the data is not normal
What do you do if your data is not Normal?
Calculate a non-parametric measure of data spread eg interquartile range
>IQR(vector)
Or
Median average deviation (MAD)- this finds the median of the absolute differences from the median and then multiples by a constant (1.4826)- which makes it comparable with the standard deviation
>mad(vector)
What is the code for summary and what does it show you?
> summary(vector)
Reports:
Minimum
Maximum
Median
Mean
1st quartile
3rd quartile
How do you graphically show that random data is approx normal?
“Normal probability plot”
Any curving will show that the distribution has short or long tails
The line is drawn through points formed by the 1st and 3rd quartiles
>qqnorm(vector,main=“normal (0,1)”)
>qqline(vector)
What does a data transformation do?
Attempts of approximate normality before parametric stats can be applied
If data cant be converted to normality non parametric stats have to be used
Common data transformation process
Logarithm of the data - log(x+1)
>qqnorm(log(vector+1))
>qqline(log(vector+1))
Test if it worked with a normality test
Barcharts in r
> barplot(vector)
How to generate a more informative barplot
> table(vector)
barplot(table(vector))
How to change the scale on a barplot
> barplot(table(vector)/valuemeasured(vector))
How to add labels to a barplot
> labels=as.vector*(c(“one”, “two”,”three”))
barplot((table(vector)/measurement(vector)), names.arg=labels , xlab**=“Number of children”, ylab=“relative frequency”)
*actually write as.vector here
**label for x axis etc
Histogram code
> hist(vector)
How to upload a larger data set
> dataset = read.table(“name of file”, header = TRUE)
attach(dataset)
dataset*
*this will show you the attached data set
summary(dataset)
Binomial or chi squared
Nominal or frequency data
2 categories
Chi-squared
Nominal or frequency
More than 2 categories
Pesaron product moment / spearman rank
Interval or ratio data and measures with a reasonably normal distribution
2 conditions
Testing hypotheses about:
Correlation - relationship between two dependent variables
Simple linear regression
Interval or ratio data and measures with a reasonably normal distribution
2 conditions
Testing hypotheses:
Regression - effect of an independent variable upon a dependant variable
T test
Interval or ratio data and measures with a reasonably normal distribution
2 conditions
Testing hypotheses about
Means
Independent measures design
T test
Interval or ratio data and measures with a reasonably normal distribution
2 conditions
Testing hypothesis about- means
Matched measures or repeated measures designs
Analysis of variance - ANOVA
Parametric
Interval or ratio data and measures with a reasonably normal distribution
More than 2 conditions
Testing hypotheses about - means
Difference between means
Null hypotheses= there is no significant difference between the means of two conditions
Multiple linear regression
Interval or ratio data and measures with a reasonably normal distribution
More than 2 conditions
Testing hypotheses about - regression (effect of 2 or more independent variables upon a dependant variable)
Spearman rank
Ordinal data or non-normal distribution of measure
2 conditions
Testing hypotheses - correlation - relationship between two dependent variables
Mann- Whitney
Ordinal data or non-normal distribution or measure
2 conditions
Testing hypotheses - medians
Independent measures
Wilcoxon
Ordinal data or non normal distribution of measure
2 conditions
Testing hypotheses about - medians
Repeated measures
Krystal - Wallis
Ordinal data or non-normal distribution of measure
More than 2 conditions
Non-parametric analysis of variance
Independent measures
Friedman
Ordinal or non-normal distribution of measure
More than 2 conditions
Non- parametric analysis of variance
Repeated measures
Continuous variable
Take on any value within a given range
There are an infinite number of possible values, limited only by our ability to measure them eg distance
Discrete variable
Only certain distinct values within a given range
The scale is still meaningful - cant have half numbers
Categorical variable
One in which the value taken by the variable is a non numerical category or class
Ranked variable
Is a categorical variable in which the categories imply some order or relative positive
Numerical values are usually assigned but 4 is not necessarily twice as many as two
How to set class intervals
- Use intervals of equal length with midpoints at convenient round numbers
- For small data sets, se a small number of intervals
- For large data sets , use more intervals
Stem leaf plots
Allow a summary of the data , retaining the original values
1. Stem consists of a column of figures, omitting the last digit
2. Add the final digit of each weight in the final row
3. Put the “leaves” in order
Interquartile range IQR
Based on the median
Divides the data into four equal groups and looks to see how far apart the extreme groups are
- Put the data in numerical order
- Find the overal median. Divide the data set in two subsets with an equal size. If n for the whole set of data is odd, out the overall median in both subsets
- Find the median for the lower groups . This is the first quartile
- Find the median for the upper group. This is the third quartile
Interquartile range is IQR=Q3-Q1
What is a box whisker plot
A way to illustrate the IQR
A good way to demonstrate the differences between groups
Standard deviation
A measure of spread around the mean
A bit like the average of the data from the mean
Random variable?
Is the numerical outcome of a random experiment
Binomial distribution
Variable is one with just two possible outcomes eg a single toss of a coin
- one outcome = success and the other= failure
What are the four attributes of the normal distribution?
Variously -
wide and flat
Or
Narrow and high
Chi- squared test
Suitable for frequency data: counts of things
Do the number of individuals in different categories fit a null hypothesis of some sort (the expectation)
Yates correction of 1df
Apply where there are only two categories of data (Eg. Male and female)
Substract 0.5 from each value of O-E ignoring the sign IO-EI-0.5
Continue rest of calculation as normal
Mann-Whitney test- detailed
Non parametric alternative to the unpaired t-test
Tests for the significant difference between the median of two independent groups
Use this test when one or both groups have non-normal distribution
Wilcoxon paired sample test- detailed
Non parametric alternative to the paired t-test
Tests for a significant difference between the medians of paired observations
Use this test when one or both groups have a non normal distribution (or cannot be induced to be normal)
Krystal- Wallis
Non-parametric one-way analysis of variance
Non-parametric alternative to one way ANOVA
Friedman’s
Non-parametric to way analysis of variance - alternative
Used to detect differences in medians between three of more treatments of the same subject
Wide variations of the standard deviations for rows or columns of a data matrix suggest that we cannot use parametric ANOVA
Parametric stats
Based on assumptions about the distributions of population from which the sample was taken
Evaluate hypotheses for a particular parameter usually the population mean
Quantitative data
Require assumptions about the distributional characteristics of the population distribution
- normal data
- equal variance
More powerful than non parametric test when assumptions are met
Non parametric stats
Evaluate hypotheses for entire population distributions
Quantitative,ranked qualitative data
Require no assumptions (distribution free) so used with non normal distributions and when variance of the groups are not equal
Generally easy to compute
List of parametric tests
Paired t test
Unpaired t test
Pearson correlation
ANOVA
Non parametric test- examples
Wilcoxon rank sum test
Mann-Whitney U test
Spearman correlation
Kruskal Wallis test
Friedman
Hierarchical clustering - what is it
- A way to find hierarchical patterns of similarity between sets of objects
- Not a test. There is no null hypothesis. No assumption about the distribution of the data
When to use it hierarchical clustering?
You have objects or things described by a large number of continuous or discrete variables
Some implementations also work with ordinal or categorical variables
Allows you to visualise this graphically (dendogram or tree)
Hierarchical clustering: three steps
- Data transformation (eg Z-scores)
- Matrix of similarities , differences or distances (eg Euclidean)
- Clustering algorithm (eg UPGMA, average neighbour)
Principal component analysis (PCA)- what is it
- A data reduction technique
- Not a test. There is no null hypothesis. No assumption about the distribution of the data
When to use principle component analysis (PCA)-
Ie which variables
- what do you explore
-what do you see
You have objects or things described by a large number of continuous or discrete variables (not ordinal or categorical)
You want to explore the differences between the objects as measured by all the variables simultaneously
Allows you to visualise this graphically (space- filling model)
Multiple regression - what is it?
- An extension of linear regression to situations where there is more than one independent variable
- A data reduction technique. Seeks to explain a reasonable fraction of the variance in the dependent variable using only some of the independent variables
When to use multiple regression?
You have objects or things described by a large number of continuous or discrete variables
These are distributed reasonably normally
Why do you test for normality before performing a variance ratio test for the equality of variances
The variance test is sensitive to departures from normality
Independent variables = random
Temperature = random
ANOVA
Independent variables - fixed
Barely = 2 varieties
Interval or ratio data and Measures with a reasonably normal distribution
categories ranked and have equal spacing between adjacent values
only ratio scaled have a true zeros
- zero is treated as a point of origin
2 categories - which parametric tests
Binomial / chi squared
2 conditions:
Pearson product
Simple linear regression
T-test (unpaired and paired)
Factorial analysis
Multiple hypothesis
describe variability among observed , correlated variables
Describe graphs of both Pierson’s regression and spearman’s ranks
Piersons regression = straight line
Spearman’s rank= only has to be correlated
Simple linear regression
A regression model that estimates the relationship between one independent variable and one dependent variable using a straight line
Regression analysis
Reliable method of identifying which variables have an impact on a topic of interest.
the process of performing a regression allows you to confidently determine which factors matter most , which factors can be ignored and how the factors influence each other
Tests that are used for more than 2 categories
Chi squared
ANOVA
Kruksal wallis
Friedman
Multiple linear regression
Regression model that estimates the relationship between a quantitative dependent variable and 2 or more independent variables using a straight line
Principle component analysis regression - when is it used
For variables that are strongly correlated
PCA technique is using in processing data where multi-linearity exists between the features / variables
How to test for significant differences between medians of 2 paired observations
3 steps
1- calculated difference between groups
2- absolute differences
3- rank absolute differences
Ordinal data meaning
Ordinal data violates the assumption of normal distribution
- categories within a variables that have a natural rank order
Variance
Tells you the degree of spread in your data set
Variance test - what does it do
Sees if the variance of 2 populations from which the samples have been drawn is equal or not
SSyy = SSR + SSE
SSyy= variation explained by regression
SSR= regression
SSE= error
How to calculate SSE
Sum of the squared estimate of errors
The regression/least squares line…
Is the line with the smaller SSE
What is regression analysis - equation
X= predictor variable
Y= response variable
Y= a+bx
Regression analysis explained
Regression analysis is a set of statistical methods used for the estimation of relationships between a dependent variable and one or more independent variables
It can be utilised to assess the strength of the relationship between variables and for modelling the future relationship between them
What is SSR
SSR is the additional amount of explained variability in Y due to the regression model compared to the baseline model
What is multicollinearity and why is it a problem
It is the phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy
It is a problem because it undermines the statistical significance of an independent variable
What does it mean if the standard error of a regression coefficient is large
The coefficient will be less statistically significant
What is PCA
Is a tool for exploring the structure of multi variate data
Data reduction technique
- allows us to reduce the number of variables to a manageable number of new variables or components
Limitations of PCA
Variables must be continuous or on an interval scale
Two types of PCA
Covariance matrix - applies more weigh to some variables than others
Correlation matrix - expression each variable with equal weight
1 sample t test
1 mean value is significantly different to a set mean
2 sample t test
Test whether unknown population means are equal or not
Unpaired t test
2 different categories eg different weights of lemurs
Or how many carrots boys and girls eat
Paired test
Speed of a human wearing 1 type of shoe compared to another
The measurement use be paired due to different running speeds of humans no matter the shoe type
One tailed
Only one way the results can go - directional results
The area of distribution is (for example) greater than the value specified in the null hypothesis
Two tailed
Critical area of distribution is two sided and tests whether a sample is greater than or less than a certain range of values
Group higher scored higher or lower than Group B
Types of hierarchical clustering
Pubs in two towns = pre defined clusters (already close together)
Geographical mid point of all Swindon pubs and the mid point of bath pubs and measure that distance = centroid clustering
Average distance between every pub in Swindon and every pub in bath = average linkage clustering
Closest pair of pubs one from each town = single linkage or nearest neighbour clustering
Take the most distant pair = complete linkage clustering
Distance matrix
Condense univariate distances down to a single number
Add them up: manhattan / duty block distance
Or
Euclidean distance = square root of the sum of the squares of univariate distances
Requirements of mannwhitney
Rank observations as if they were single sample - eg smallest to largest