4. descriptive statistics Flashcards
stats + graphs
Descriptive statistics
central tendencies and variability together comprise of descriptive statistics
statistics
Signs of analysing reviewing and concluding the data
Some basic statistical numbers include:
Mean, median and mode
Minimum and maximum value
Percentiles
Variance and Standard Devation
Covariance and Correlation
Probability distributions
dataset
Collection of data often presented in table
eg: mtcars
1. dataset_name: To print the data set
2. ?: Give the complete information about the data set in help window
3. dim0(): Find the dimension of the data set
4. name(): views the names of the variables of the data set
5. rowname(): Gives the name of each row in first column
6. $variname: Prince all values that belong to a variable
7. sort(): Sort the values
8. summary() If the statistical summary of the whole data
- It gives six statistical numbers for each variable:
1. Min
2. First quantile (percentile)
3. Median
4. Mean
5. Third quantile (percentile)
6. Max
Min and Max
min() and max() Are built in math functions in R which gives lowest and highest values in the data set
-max(dataset_$vari)
eg: Data_Cars <- mtcars
max(Data_Cars$hp)
min(Data_Cars$hp)
- To file the index position of Min and Max value in the table :
**- which.max() - which.min()**
combine which.max() and which.min() with the rownames() function to get the name of the car with the largest and smallest horsepower:
Data_Cars <- mtcars
rownames(Data_Cars)[which.max(Data_Cars$hp)]
rownames(Data_Cars)[which.min(Data_Cars$hp)]
Outliers
Max and min can also be used to detect outliers. An outlier is a data point that differs from rest of the observations.
central tendencies
- mean()
- median()
- mode
mean
The average value
It is normally calculated as some of the By number of values… But in R mean() Function is already present
3 types:
1. Arithmetic mean–mean(x)
2. Geometric mean–prod(x)^(1/length(x))
prod (x)-sum of all val of x
^- power sign
length(x)- Number of elements in X
- Harmonic mean– 1/(mean(1/x))
eg: data_cars <- mtcars
mean(data_cars$wt)
median
Median - The middle value
median()
eg:
x<- c(1,2,3)
median(x)
o/p:
2
Mode
Mode - The most common value
R doesn’t have function to calculate the mode however we can create our own function to find it
eg:
m<- function(x) {
t <- table(x)
n <- as.numeric(names(t[t==max(t)]))
return(n)
}
val<- c(1, 2, 3, 3, 4, 5)
mn <- m(val)
print(mn)
o/p:
3
range
Different between highest value and lowest value
Range can be performed by two methods:
1. range() function:
range(vector of values, na.rm = FALSE)
eg: a <- c(1, 2, 3, 4, 10, NaN)
range(a, na.rm=TRUE)
2. max()-min()
Variability
- Also known as statistical dispersion
- Measures of central tendency and variability together comprise of descriptive stats
Following are some of the measures of variability that R offers to differentiate between data sets:
1) Variance
2) Standard deviation
3) range
4) Mean deviation
5) interquartile range
variance
-Variance is a measure of How much value is away from mean value
According to Layman, a variance is a measure of how far a set of data (numbers) are spread out from their mean (average) value.
var(x)
-x : numeric vector
eg:
x <- c(1, 2, 3, 4, 5, 6, 7)
var(x)
Output:
4.667
Standard deviation
Square root of variance is standard deviation
sd(x)
x- Numeric vector
eg:
x2 <- c(1, 2, 3, 4, 5, 6, 7)
sd(x2)
Output:
2.200
interquartile and quartile dev
interquartile
-Difference of third and first quartiles
IQR(x)
quartile
Dividing quartile range by 2
IQR(x)/2
Correlation in R
Statistical measure that indicates how strongly two variables are related it also involves relationship between multiple variables
- Correlation general lies between -1 and +1
Pearson correlation testing in R
1. Pearson rang correlation coefficient implementation in R:
cor()- Computes correlation coefficient
cor(x, y, method = “pearson”)
where:
x, y: numeric vectors with the same length
method: correlation method
eg:
x = c(1, 2, 3, 4, 5, 6, 7)
y = c(1, 3, 6, 2, 7, 4, 5)
result = cor(x, y, method = “pearson”)
cat(“Pearson correlation coefficient is:”, result)
o/p:
Pearson correlation coefficient is: 0.5357143
cor.test() Computes the test for correlation between spare samples
cor.test(x, y, method = “pearson”)
eg:
x = c(1, 2, 3, 4, 5, 6, 7)
y = c(1, 3, 6, 2, 7, 4, 5)
result = cor.test(x, y, method = “pearson”)
print(result)
o/p:
Pearson’s product-moment correlation
data: x and y
t = 1.4186, df = 5, p-value = 0.2152
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3643187 0.9183058
sample estimates:
cor
0.5357143
covariance
- It also measures relationship between two random variables
- It also measures linear dependency between pair of random variables or bivariate data as correlation
cov(x, y, method)
where:
x, y- Represents data vectors
method- Type of method to be used to compute covariance (Default is Pearson)
eg:
x <- c(1, 3, 5, 10)
y <- c(2, 4, 6, 20)
print(cov(x, y))
print(cov(x, y, method = “pearson”))
print(cov(x, y, method = “kendall”))
print(cov(x, y, method = “spearman”))
**Output: **
[1] 30.66667
[1] 30.66667
[1] 12
[1] 1.666667