03. Review of Basic Data Analytic Methods Using R Flashcards
What is a factor
A factor is vector of values that are limited to a fixed set of values (categories). Factors always have levels (the things it allows to be put in - like it had a data validation drop down list). When you first create the factor it will assume the different options are the initial options, after that it will check against this set of levels.
What is a list
A list is a vector of values which can contain different data types
What is an array
An array can only contain the same type of data. Arrays can multi-dimensional. i.e. rows and columns and sheets and workbooks etc
What is a dataframe
A dataframe is a table of vectors or factors; all items of the same length. Individual columns are the same data type, but different columns can be different data types.
What is a record
A single cell
What is a pairs plot
It plots every variable against every other variable. Also known as a Splom or a ScatterPlot
What do T-Tests use
Samples of the population (not the full population)
What are T-Tests use
Samples of the population (not the full population) are tested to compare against a NULL hypothesis (i.e. checking if there is a statistical significance)
What can an AVOVA test be used for
AVOVA are used in hypothesis testing when you have more than two sample populations
What is hypothesis testing
where you are picking between the null and the alternative hypothesis
What is statistical power
Statistical Power is a measure of how well that test compares against the null
What is a parametric distribution
The data follows a normal distribution
How is the standard deviation calculated
The standard deviation is the square root of the deviance
The T-Test
Parametric. Can be one test and two test
A “students t-test” is another name for the two sample t test.
Welch’s test
Parametric. But can cope with different standard deviations (hence automatically two sample)
Wilcoxon Rank Sum Test
Non-Parametric. It tests is two populations of numbers are equally distributed.
AVOVA
We use ANOVA tests to perform multiple comparisons across more than two populations of data.
Example: You have an online shop that gives two offers or none at all. We want to find out whether the offers are affecting the number of purchases.
The point of the ANOVA is to determine whether the variance in the dataset is due to the spread of values within each group or because of the spread of values between groups.
What is the p value
The P-value is the likelihood that the NULL hypothesis is true. So a low p value means the NULL is very unlikely so happy days.
What is a vector
A vector can only consist of one class and represents a single column
What is a matrix
A two dimensional array
If we reject the NULL hypothesis but it is actually true what type of error is this
Type 1 error
How does the AVOVA test work
It’s like clustering and looks at the between groups mean sum of squares (between group variance) and the within group mean sum of squares (within group variance)