4. descriptive statistics Flashcards

stats + graphs

1
Q

Descriptive statistics

A

central tendencies and variability together comprise of descriptive statistics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

statistics

A

Signs of analysing reviewing and concluding the data
Some basic statistical numbers include:

Mean, median and mode
Minimum and maximum value
Percentiles
Variance and Standard Devation
Covariance and Correlation
Probability distributions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

dataset

A

Collection of data often presented in table
eg: mtcars
1. dataset_name: To print the data set
2. ?: Give the complete information about the data set in help window
3. dim0(): Find the dimension of the data set
4. name(): views the names of the variables of the data set
5. rowname(): Gives the name of each row in first column
6. $variname: Prince all values that belong to a variable
7. sort(): Sort the values
8. summary() If the statistical summary of the whole data
- It gives six statistical numbers for each variable:
1. Min
2. First quantile (percentile)
3. Median
4. Mean
5. Third quantile (percentile)
6. Max

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Min and Max

A

min() and max() Are built in math functions in R which gives lowest and highest values in the data set
-max(dataset_$vari)
eg: Data_Cars <- mtcars
max(Data_Cars$hp)
min(Data_Cars$hp)

  • To file the index position of Min and Max value in the table :
    **- which.max()
  • which.min()**

combine which.max() and which.min() with the rownames() function to get the name of the car with the largest and smallest horsepower:
Data_Cars <- mtcars
rownames(Data_Cars)[which.max(Data_Cars$hp)]
rownames(Data_Cars)[which.min(Data_Cars$hp)]

Outliers
Max and min can also be used to detect outliers. An outlier is a data point that differs from rest of the observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

central tendencies

A
  1. mean()
  2. median()
  3. mode
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

mean

A

The average value
It is normally calculated as some of the By number of values… But in R mean() Function is already present
3 types:
1. Arithmetic mean–mean(x)
2. Geometric mean–prod(x)^(1/length(x))
prod (x)-sum of all val of x
^- power sign
length(x)- Number of elements in X

  1. Harmonic mean– 1/(mean(1/x))

eg: data_cars <- mtcars
mean(data_cars$wt)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

median

A

Median - The middle value
median()
eg:
x<- c(1,2,3)
median(x)
o/p:
2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Mode

A

Mode - The most common value
R doesn’t have function to calculate the mode however we can create our own function to find it
eg:
m<- function(x) {
t <- table(x)
n <- as.numeric(names(t[t==max(t)]))
return(n)
}
val<- c(1, 2, 3, 3, 4, 5)
mn <- m(val)
print(mn)
o/p:
3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

range

A

Different between highest value and lowest value
Range can be performed by two methods:
1. range() function:
range(vector of values, na.rm = FALSE)
eg: a <- c(1, 2, 3, 4, 10, NaN)
range(a, na.rm=TRUE)
2. max()-min()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Variability

A
  1. Also known as statistical dispersion
  2. Measures of central tendency and variability together comprise of descriptive stats
    Following are some of the measures of variability that R offers to differentiate between data sets:
    1) Variance
    2) Standard deviation
    3) range
    4) Mean deviation
    5) interquartile range
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

variance

A

-Variance is a measure of How much value is away from mean value
According to Layman, a variance is a measure of how far a set of data (numbers) are spread out from their mean (average) value.
var(x)
-x : numeric vector
eg:
x <- c(1, 2, 3, 4, 5, 6, 7)
var(x)
Output:
4.667

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Standard deviation

A

Square root of variance is standard deviation
sd(x)
x- Numeric vector
eg:
x2 <- c(1, 2, 3, 4, 5, 6, 7)
sd(x2)
Output:
2.200

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

interquartile and quartile dev

A

interquartile
-Difference of third and first quartiles
IQR(x)
quartile
Dividing quartile range by 2
IQR(x)/2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Correlation in R

A

Statistical measure that indicates how strongly two variables are related it also involves relationship between multiple variables
- Correlation general lies between -1 and +1
Pearson correlation testing in R
1. Pearson rang correlation coefficient implementation in R:
cor()- Computes correlation coefficient
cor(x, y, method = “pearson”)
where:
x, y: numeric vectors with the same length
method: correlation method
eg:
x = c(1, 2, 3, 4, 5, 6, 7)
y = c(1, 3, 6, 2, 7, 4, 5)
result = cor(x, y, method = “pearson”)
cat(“Pearson correlation coefficient is:”, result)
o/p:
Pearson correlation coefficient is: 0.5357143

cor.test() Computes the test for correlation between spare samples
cor.test(x, y, method = “pearson”)
eg:
x = c(1, 2, 3, 4, 5, 6, 7)
y = c(1, 3, 6, 2, 7, 4, 5)
result = cor.test(x, y, method = “pearson”)
print(result)
o/p:
Pearson’s product-moment correlation
data: x and y
t = 1.4186, df = 5, p-value = 0.2152
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3643187 0.9183058
sample estimates:
cor
0.5357143

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

covariance

A
  1. It also measures relationship between two random variables
  2. It also measures linear dependency between pair of random variables or bivariate data as correlation
    cov(x, y, method)
    where:
    x, y- Represents data vectors
    method- Type of method to be used to compute covariance (Default is Pearson)
    eg:
    x <- c(1, 3, 5, 10)
    y <- c(2, 4, 6, 20)
    print(cov(x, y))
    print(cov(x, y, method = “pearson”))
    print(cov(x, y, method = “kendall”))
    print(cov(x, y, method = “spearman”))
    **Output: **
    [1] 30.66667
    [1] 30.66667
    [1] 12
    [1] 1.666667
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Conversion of covariance into correlation in R

A

cov2cor(x)
x- Covariance squared matrix
eg:
x <- rnorm(2)
y <- rnorm(2)
print(cov2cor(X))
**Output: **
x y
x 0.0742700 -0.1268199
y -0.1268199 0.2165516

x y
x 1 -1
y -1 1

x y
x 1 -1
y -1 1

17
Q

Difference between covariance and correlation

A

Correlation describes the intensity and direction of the linear link between two variables, whereas covariance shows how much two variables vary together.

https://www.geeksforgeeks.org/covariance-and-correlation-in-r-programming/

18
Q

Data visualisation

A
  1. In our we can create visually appealing data visualisations by writing few lines of code
  2. By using the data visualization technique, we can work with large datasets to efficiently obtain key insights about it.
19
Q

R visualisation packages

A
  1. Plotly
  2. Ggplot2
  3. tidyquant
  4. taucharts
  5. ggiraph
  6. geofacet
  7. googleVis
  8. RColorBrewer
  9. Dygraphs
  10. Shiny
20
Q

R graphics

A

Graphics are used to examine marginal distributions, relationships between variables, and summary of very large data.
standard graphics:
1. scatterplots
2. pie charts
3. line chart
4. barplots
5. Histogram

21
Q

Key elements of stactical graphic

A
  1. Data - processed and generates an output.
  2. Aesthetic mappings - It controls the relation between graphics variables and data variables. In a scatter plot, it also helps to map the temperature variable of a data set into the X variable.
  3. geametric objects - express each observation by a point using the aesthetic mappings. It maps two variables in the data set into the x,y variables of the plot.
  4. Statistical transformation - Allows us to calculate statistical analysis of data in the plot it Uses the data and approximates it with the help of regression line having X and Y coordinates and counts occurrences of certain values
  5. Scales- Used to map the data values in values present in coordinates system of the graphic device
    6.coordinates systems
    -The coordinate system plays an important role in the plotting of the data.
    * Cartesian
    * Plot
    7.Faceting - Faceting is used to split the data into subgroups and draw sub-graphs for each group.
22
Q

Advantages and disadvantages of data visualisation in R

A

adv
1. Easy to understand and analyse
2. Is more efficient as its application allows to display lot of information in small space
3. Look utilising features such as geographic maps and gis can be very relevant to wider businesses when the location is very relevant factor as it uses maps to show business insights from various locations

disadv:
1. It costs more
2. It creates distraction data visualisation apps create highly complex and fancy graphic rich reports and charts which may entice users to focus more form on the form than the function

23
Q

pie chart

A
  1. Circular chart which is divided into different segments according to the ratio of data
  2. The total value of piles 100 and Segments tell the fraction of whole pie
    pie(x, labels, col, main, radius)
    x- Data vector
    labels- Names for each slice
    col- Colour for each slice
    Main - title of the pie chart
    radius- Radius of the pie chart which is b/w -1 & +1
    eg:
    exp <- c(1,2,3,4,5)
    pie(e, labels=c(6,7,8,9,10), main= “pie”)
24
Q

Bar chart

A
  1. Bar chart uses rectangular bars to visualise data
  2. Bar charts can be displayed horizontally or vertically
  3. The length or height of the bats are proportional to the values they represent
    barplot()
  4. data Data vector to be represented on y axis
  5. xlab label given to Xaxis
  6. ylab label given to y axis
  7. names.arg:names of each observation in the axis
  8. main()- Title of the barchart

eg:
library(ggplot2)
data <- data.frame(
Category = c(“A”, “B”, “C”, “D”, “E”),
Value = c(10, 20, 15, 25, 30)
)
ggplot(data, aes(x = Category, y = Value)) +
geom_bar(stat = “identity”, fill = “skyblue”) +
labs(title = “Bar Graph Example”,
x = “Category”,
y = “Value”)

25
Q

scatterplots

A
  1. A “scatter plot” is a type of plot used to display the relationship between two numerical variables, and plots one dot for each observation.
  2. It needs two vectors of same length, one for the x-axis (horizontal) and one for the y-axis (vertical)
    syn:
    plot(x, y, type, xlab, ylab, main)
    x- data on xaxis
    y- data on Yaxis
    type Specifies which type of Plot is drawn
    eg: l– lines p– points……
    xlab
    ylab
    main
    eg:
    temperature <- c(65, 70, 75, 80, 85, 90)
    ice_cream_sold <- c(100, 120, 140, 160, 180, 200)
    plot(temperature, ice_cream_sold,
    xlab = “Temperature (Fahrenheit)”,
    ylab = “Ice Cream Cones Sold”,
    main = “Temperature vs Ice Cream Sales”,
    col = “blue”,
    pch = 16)

    abline(lm(ice_cream_sold ~ temperature), col = “red”)
26
Q

Histogram

A
  1. a type of bar chart which shows the freq of the number of val which are compared with set of val ranges
  2. The histogram is used for the distribution, whereas a bar chart is used for comparing different entities.
  3. each bar represents the height of the number of values present in the given range.
    hist(x, oth parameters)
    parameters:
    v- Vector that contains numerical values
    Main - title of the chart
    col - Colour of bars
    Border- Border of each bar
    xlab- describe x axis
    yLab- described y axis
    xlim-Specify range of values on x axis
    ylim- Specify range of values on y axis
    eg:
    data <- c(65, 70, 75, 80, 85, 90)
    hist(data, main = “Histogram of Example Data”, xlab = “Value”, ylab = “Frequency”, col = “lightblue”,
    border = “black”)
27
Q

line graph

A
  1. Used for explanatory data analysis to cheque the data trends by observing the line pattern of the line graph
  2. line graphs are used for time series data analysis
    plot(x, y, type, col, main, xlab, ylab)
    eg:
    data <- c(65, 70, 75, 80, 85, 90)
    plot(data, type = “l”, col = “blue”, main = “Line Graph of Example Data”, xlab = “Index”, ylab =
    “Value”)