4. descriptive statistics Flashcards
stats + graphs
Descriptive statistics
central tendencies and variability together comprise of descriptive statistics
statistics
Signs of analysing reviewing and concluding the data
Some basic statistical numbers include:
Mean, median and mode
Minimum and maximum value
Percentiles
Variance and Standard Devation
Covariance and Correlation
Probability distributions
dataset
Collection of data often presented in table
eg: mtcars
1. dataset_name: To print the data set
2. ?: Give the complete information about the data set in help window
3. dim0(): Find the dimension of the data set
4. name(): views the names of the variables of the data set
5. rowname(): Gives the name of each row in first column
6. $variname: Prince all values that belong to a variable
7. sort(): Sort the values
8. summary() If the statistical summary of the whole data
- It gives six statistical numbers for each variable:
1. Min
2. First quantile (percentile)
3. Median
4. Mean
5. Third quantile (percentile)
6. Max
Min and Max
min() and max() Are built in math functions in R which gives lowest and highest values in the data set
-max(dataset_$vari)
eg: Data_Cars <- mtcars
max(Data_Cars$hp)
min(Data_Cars$hp)
- To file the index position of Min and Max value in the table :
**- which.max() - which.min()**
combine which.max() and which.min() with the rownames() function to get the name of the car with the largest and smallest horsepower:
Data_Cars <- mtcars
rownames(Data_Cars)[which.max(Data_Cars$hp)]
rownames(Data_Cars)[which.min(Data_Cars$hp)]
Outliers
Max and min can also be used to detect outliers. An outlier is a data point that differs from rest of the observations.
central tendencies
- mean()
- median()
- mode
mean
The average value
It is normally calculated as some of the By number of values… But in R mean() Function is already present
3 types:
1. Arithmetic mean–mean(x)
2. Geometric mean–prod(x)^(1/length(x))
prod (x)-sum of all val of x
^- power sign
length(x)- Number of elements in X
- Harmonic mean– 1/(mean(1/x))
eg: data_cars <- mtcars
mean(data_cars$wt)
median
Median - The middle value
median()
eg:
x<- c(1,2,3)
median(x)
o/p:
2
Mode
Mode - The most common value
R doesn’t have function to calculate the mode however we can create our own function to find it
eg:
m<- function(x) {
t <- table(x)
n <- as.numeric(names(t[t==max(t)]))
return(n)
}
val<- c(1, 2, 3, 3, 4, 5)
mn <- m(val)
print(mn)
o/p:
3
range
Different between highest value and lowest value
Range can be performed by two methods:
1. range() function:
range(vector of values, na.rm = FALSE)
eg: a <- c(1, 2, 3, 4, 10, NaN)
range(a, na.rm=TRUE)
2. max()-min()
Variability
- Also known as statistical dispersion
- Measures of central tendency and variability together comprise of descriptive stats
Following are some of the measures of variability that R offers to differentiate between data sets:
1) Variance
2) Standard deviation
3) range
4) Mean deviation
5) interquartile range
variance
-Variance is a measure of How much value is away from mean value
According to Layman, a variance is a measure of how far a set of data (numbers) are spread out from their mean (average) value.
var(x)
-x : numeric vector
eg:
x <- c(1, 2, 3, 4, 5, 6, 7)
var(x)
Output:
4.667
Standard deviation
Square root of variance is standard deviation
sd(x)
x- Numeric vector
eg:
x2 <- c(1, 2, 3, 4, 5, 6, 7)
sd(x2)
Output:
2.200
interquartile and quartile dev
interquartile
-Difference of third and first quartiles
IQR(x)
quartile
Dividing quartile range by 2
IQR(x)/2
Correlation in R
Statistical measure that indicates how strongly two variables are related it also involves relationship between multiple variables
- Correlation general lies between -1 and +1
Pearson correlation testing in R
1. Pearson rang correlation coefficient implementation in R:
cor()- Computes correlation coefficient
cor(x, y, method = “pearson”)
where:
x, y: numeric vectors with the same length
method: correlation method
eg:
x = c(1, 2, 3, 4, 5, 6, 7)
y = c(1, 3, 6, 2, 7, 4, 5)
result = cor(x, y, method = “pearson”)
cat(“Pearson correlation coefficient is:”, result)
o/p:
Pearson correlation coefficient is: 0.5357143
cor.test() Computes the test for correlation between spare samples
cor.test(x, y, method = “pearson”)
eg:
x = c(1, 2, 3, 4, 5, 6, 7)
y = c(1, 3, 6, 2, 7, 4, 5)
result = cor.test(x, y, method = “pearson”)
print(result)
o/p:
Pearson’s product-moment correlation
data: x and y
t = 1.4186, df = 5, p-value = 0.2152
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3643187 0.9183058
sample estimates:
cor
0.5357143
covariance
- It also measures relationship between two random variables
- It also measures linear dependency between pair of random variables or bivariate data as correlation
cov(x, y, method)
where:
x, y- Represents data vectors
method- Type of method to be used to compute covariance (Default is Pearson)
eg:
x <- c(1, 3, 5, 10)
y <- c(2, 4, 6, 20)
print(cov(x, y))
print(cov(x, y, method = “pearson”))
print(cov(x, y, method = “kendall”))
print(cov(x, y, method = “spearman”))
**Output: **
[1] 30.66667
[1] 30.66667
[1] 12
[1] 1.666667
Conversion of covariance into correlation in R
cov2cor(x)
x- Covariance squared matrix
eg:
x <- rnorm(2)
y <- rnorm(2)
print(cov2cor(X))
**Output: **
x y
x 0.0742700 -0.1268199
y -0.1268199 0.2165516
x y
x 1 -1
y -1 1
x y
x 1 -1
y -1 1
Difference between covariance and correlation
Correlation describes the intensity and direction of the linear link between two variables, whereas covariance shows how much two variables vary together.
https://www.geeksforgeeks.org/covariance-and-correlation-in-r-programming/
Data visualisation
- In our we can create visually appealing data visualisations by writing few lines of code
- By using the data visualization technique, we can work with large datasets to efficiently obtain key insights about it.
R visualisation packages
- Plotly
- Ggplot2
- tidyquant
- taucharts
- ggiraph
- geofacet
- googleVis
- RColorBrewer
- Dygraphs
- Shiny
R graphics
Graphics are used to examine marginal distributions, relationships between variables, and summary of very large data.
standard graphics:
1. scatterplots
2. pie charts
3. line chart
4. barplots
5. Histogram
Key elements of stactical graphic
- Data - processed and generates an output.
- Aesthetic mappings - It controls the relation between graphics variables and data variables. In a scatter plot, it also helps to map the temperature variable of a data set into the X variable.
- geametric objects - express each observation by a point using the aesthetic mappings. It maps two variables in the data set into the x,y variables of the plot.
- Statistical transformation - Allows us to calculate statistical analysis of data in the plot it Uses the data and approximates it with the help of regression line having X and Y coordinates and counts occurrences of certain values
- Scales- Used to map the data values in values present in coordinates system of the graphic device
6.coordinates systems
-The coordinate system plays an important role in the plotting of the data.
* Cartesian
* Plot
7.Faceting - Faceting is used to split the data into subgroups and draw sub-graphs for each group.
Advantages and disadvantages of data visualisation in R
adv
1. Easy to understand and analyse
2. Is more efficient as its application allows to display lot of information in small space
3. Look utilising features such as geographic maps and gis can be very relevant to wider businesses when the location is very relevant factor as it uses maps to show business insights from various locations
disadv:
1. It costs more
2. It creates distraction data visualisation apps create highly complex and fancy graphic rich reports and charts which may entice users to focus more form on the form than the function
pie chart
- Circular chart which is divided into different segments according to the ratio of data
- The total value of piles 100 and Segments tell the fraction of whole pie
pie(x, labels, col, main, radius)
x- Data vector
labels- Names for each slice
col- Colour for each slice
Main - title of the pie chart
radius- Radius of the pie chart which is b/w -1 & +1
eg:
exp <- c(1,2,3,4,5)
pie(e, labels=c(6,7,8,9,10), main= “pie”)
Bar chart
- Bar chart uses rectangular bars to visualise data
- Bar charts can be displayed horizontally or vertically
- The length or height of the bats are proportional to the values they represent
barplot() - data Data vector to be represented on y axis
- xlab label given to Xaxis
- ylab label given to y axis
- names.arg:names of each observation in the axis
- main()- Title of the barchart
eg:
library(ggplot2)
data <- data.frame(
Category = c(“A”, “B”, “C”, “D”, “E”),
Value = c(10, 20, 15, 25, 30)
)
ggplot(data, aes(x = Category, y = Value)) +
geom_bar(stat = “identity”, fill = “skyblue”) +
labs(title = “Bar Graph Example”,
x = “Category”,
y = “Value”)
scatterplots
- A “scatter plot” is a type of plot used to display the relationship between two numerical variables, and plots one dot for each observation.
- It needs two vectors of same length, one for the x-axis (horizontal) and one for the y-axis (vertical)
syn:
plot(x, y, type, xlab, ylab, main)
x- data on xaxis
y- data on Yaxis
type Specifies which type of Plot is drawn
eg: l– lines p– points……
xlab
ylab
main
eg:
temperature <- c(65, 70, 75, 80, 85, 90)
ice_cream_sold <- c(100, 120, 140, 160, 180, 200)
plot(temperature, ice_cream_sold,
xlab = “Temperature (Fahrenheit)”,
ylab = “Ice Cream Cones Sold”,
main = “Temperature vs Ice Cream Sales”,
col = “blue”,
pch = 16)
abline(lm(ice_cream_sold ~ temperature), col = “red”)
Histogram
- a type of bar chart which shows the freq of the number of val which are compared with set of val ranges
- The histogram is used for the distribution, whereas a bar chart is used for comparing different entities.
- each bar represents the height of the number of values present in the given range.
hist(x, oth parameters)
parameters:
v- Vector that contains numerical values
Main - title of the chart
col - Colour of bars
Border- Border of each bar
xlab- describe x axis
yLab- described y axis
xlim-Specify range of values on x axis
ylim- Specify range of values on y axis
eg:
data <- c(65, 70, 75, 80, 85, 90)
hist(data, main = “Histogram of Example Data”, xlab = “Value”, ylab = “Frequency”, col = “lightblue”,
border = “black”)
line graph
- Used for explanatory data analysis to cheque the data trends by observing the line pattern of the line graph
- line graphs are used for time series data analysis
plot(x, y, type, col, main, xlab, ylab)
eg:
data <- c(65, 70, 75, 80, 85, 90)
plot(data, type = “l”, col = “blue”, main = “Line Graph of Example Data”, xlab = “Index”, ylab =
“Value”)