Book - Chapter 3 basic analytics in r Flashcards
What is the function to import data
Read.csv
What does the head function do
Examines the imported dataset
What does the summary function do
Provide some descriptive statistics, such as mean and medium, for each data column
When referring to a column in a dataset what symbol should you use
The $
How would you plot linear regression
Lm
What does our software use
Commandline interface
How do you set a working directory
Set WD
What are the categorical/qualitative attribute types
Nominal and ordinal
What are the numeric/quantitative attribute types
Interval and ratio
What are nominal data types
ZIP Codes nationality street names gender employee ID number true or false
What is an ordinal data type
Ordered names for example
quality of diamonds
academic grades
magnitude of earthquakes
What is interval data types
Numeric with no true zero for example Celsius or Fahrenheit
What is ratio data type
Numeric with a true zero for example age or temperature in Kelvin
What is a vector
Set of values of the same data type.
What can you use to create vectors
The combined function
What dimension is a vector
They are dimensionless
What is a 2dimensional array
Matrix
What is an array
N dimensional set of homogenous data type values
What does the function nrow and nCol do
Define the number of rows and columns
What is a DataFrame
Like a spreadsheet and list but all columns are the same length
Can data frames stored different data types
Yes
What is the list
A list is a collection of vectors and to be different lengths
What is a factor
A set of categorical variables.
Fix set of values and use integer code to represent different values
What does variance mean
The distance from means squared
What does standard deviation mean
The square root of variance
What does ranged mean
Minimum to maximum
What is interquartile range
25% to 75% of the size order data
Why do we visualise
To get a sense of the data
What should we visualise
Mean versus median. Standard deviation. Quantiles. Correlations between variables
What does anscombes quartet do.
Illustrates the importance of visualising data. Uses for data sets. Each day is to set is plotted as a scatterplot and then fitted with lines with the results of applying linear regression
What should you do if the data is skewed
Logit
What does bimodal mean
It has more peaks
What is data cleansing
Eliminating dirty data
What is a plot function q
Scatter plot where x is the index and y is the value
What is a barplot function
Barplot with vertical and horizontal bars
What is a dot chart
Cleveland dot plot
What is a plot (density (data))
Density plot. A continuous histogram
What is a stem function
Stem and leave plot
How many variables can a scatter plot have
5
What is a loess line do?
Fit a non linear line to the data
What charts can be used to visualise multiple variables.
Barplot and dotchart
When would you use a hexbinplot
When dealing with large data sets
What is pairwise plot
A scatter plot matrix
What is the seasonality effect?
If a small peak or fall happens the same time every year or time series
What is the basic concept for hypothesis testing?
To form an assertion and test it with data
What is the null hypothesis
No difference
What does it mean if the regression coefficient is zero
The null hypothesis
What is the basic testing approach
To compare the observed sample means
What does a large absorb difference between the sample means indicate
That the null hypothesis should be rejected
For the difference in means how can this be tested
Students t-test or Welches t-test
What is students t-test
Assumes that distributions of two populations have equal but unknown variances
In students t-test if each population is normally distributed with the same main and with the same variance what do you do
On the T city stick follows a T distribution with degrees of freedom
If the observed value of t is far enough from zero what should you do
Reject the null hypothesis
What is a significance level
The small probability
What is the significance level of the test
The probability of rejecting no hypothesis, when the no hypothesis is actually true
What is the normal significance level
0.05
What is different in a two sided hypothesis test
It is necessary for the sum of probabilities and the both tales of the t-distribution to equal the significance level
What is the P value
Area under the tail
What is a confidence interval
Is an interval estimate of the population para meter or characteristic based on sample data
How is the confidence interval used
It is used to indicate the uncertainty of a point estimate
What is Wilcoxen tank sum test
Makes no assumption about the distributions of populations. Robust test for difference and means
What is the type one error
Rejection of the null hypothesis when the normal hypothesis is true
What is the Type II error
And acceptance of the null hypothesis when the no hypothesis is full
What is significance
Probability of a full positive
What is power
Probability of a true positive
What is affect
The size of the observed difference
What do you use it is more than two populations
A nova
What does a nova stand for
Analysis of variance
What is the F statistic in a nova
A measure of how different the means are relative to the variability within the group