1. Also known as statistical dispersion 2. Measures of central tendency and variability together comprise of descriptive stats Following are some of the measures of variability that R offers to differentiate between data sets: 1) Variance 2) Standard deviation 3) range 4) Mean deviation 5) interquartile range

1. It also measures relationship between two random variables 2. It also measures linear dependency between pair of random variables or bivariate data as correlation cov(x, y, method) where: x, y- Represents data vectors method- Type of method to be used to compute covariance (Default is Pearson) eg: x <- c(1, 3, 5, 10) y <- c(2, 4, 6, 20) print(cov(x, y)) print(cov(x, y, method = pearson)) print(cov(x, y, method = kendall)) print(cov(x, y, method = spearman)) Output: [1] 30.66667 [1] 30.66667 [1] 12 [1] 1.666667

4. descriptive statistics Flashcards by Manasa Madhuri Shika

Descriptive statistics

central tendencies and variability together comprise of descriptive statistics

How well did you know this?

Not at all

Perfectly

statistics

Signs of analysing reviewing and concluding the data
Some basic statistical numbers include:

Mean, median and mode
Minimum and maximum value
Percentiles
Variance and Standard Devation
Covariance and Correlation
Probability distributions

How well did you know this?

Not at all

Perfectly

dataset

Collection of data often presented in table
eg: mtcars
1. dataset_name: To print the data set
2. ?: Give the complete information about the data set in help window
3. dim0(): Find the dimension of the data set
4. name(): views the names of the variables of the data set
5. rowname(): Gives the name of each row in first column
6. $variname: Prince all values that belong to a variable
7. sort(): Sort the values
8. summary() If the statistical summary of the whole data
- It gives six statistical numbers for each variable:
1. Min
2. First quantile (percentile)
3. Median
4. Mean
5. Third quantile (percentile)
6. Max

How well did you know this?

Not at all

Perfectly

Min and Max

min() and max() Are built in math functions in R which gives lowest and highest values in the data set
-max(dataset_$vari)
eg: Data_Cars <- mtcars
max(Data_Cars$hp)
min(Data_Cars$hp)

To file the index position of Min and Max value in the table :
**- which.max()
which.min()**

combine which.max() and which.min() with the rownames() function to get the name of the car with the largest and smallest horsepower:
Data_Cars <- mtcars
rownames(Data_Cars)[which.max(Data_Cars$hp)]
rownames(Data_Cars)[which.min(Data_Cars$hp)]

Outliers
Max and min can also be used to detect outliers. An outlier is a data point that differs from rest of the observations.

How well did you know this?

Not at all

Perfectly

central tendencies

mean()
median()
mode

How well did you know this?

Not at all

Perfectly

mean

The average value
It is normally calculated as some of the By number of values… But in R mean() Function is already present
3 types:
1. Arithmetic mean–mean(x)
2. Geometric mean–prod(x)^(1/length(x))
prod (x)-sum of all val of x
^- power sign
length(x)- Number of elements in X

Harmonic mean– 1/(mean(1/x))

eg: data_cars <- mtcars
mean(data_cars$wt)

How well did you know this?

Not at all

Perfectly

median

Median - The middle value
median()
eg:
x<- c(1,2,3)
median(x)
o/p:
2

How well did you know this?

Not at all

Perfectly

Mode

Mode - The most common value
R doesn’t have function to calculate the mode however we can create our own function to find it
eg:
m<- function(x) {
t <- table(x)
n <- as.numeric(names(t[t==max(t)]))
return(n)
}
val<- c(1, 2, 3, 3, 4, 5)
mn <- m(val)
print(mn)
o/p:
3

How well did you know this?

Not at all

Perfectly

range

Different between highest value and lowest value
Range can be performed by two methods:
1. range() function:
range(vector of values, na.rm = FALSE)
eg: a <- c(1, 2, 3, 4, 10, NaN)
range(a, na.rm=TRUE)
2. max()-min()

How well did you know this?

Not at all

Perfectly

Variability

Also known as statistical dispersion
Measures of central tendency and variability together comprise of descriptive stats
Following are some of the measures of variability that R offers to differentiate between data sets:
1) Variance
2) Standard deviation
3) range
4) Mean deviation
5) interquartile range

How well did you know this?

Not at all

Perfectly

variance

-Variance is a measure of How much value is away from mean value
According to Layman, a variance is a measure of how far a set of data (numbers) are spread out from their mean (average) value.
var(x)
-x : numeric vector
eg:
x <- c(1, 2, 3, 4, 5, 6, 7)
var(x)
Output:
4.667

How well did you know this?

Not at all

Perfectly

Standard deviation

Square root of variance is standard deviation
sd(x)
x- Numeric vector
eg:
x2 <- c(1, 2, 3, 4, 5, 6, 7)
sd(x2)
Output:
2.200

How well did you know this?

Not at all

Perfectly

interquartile and quartile dev

interquartile
-Difference of third and first quartiles
IQR(x)
quartile
Dividing quartile range by 2
IQR(x)/2

How well did you know this?

Not at all

Perfectly

Correlation in R

Statistical measure that indicates how strongly two variables are related it also involves relationship between multiple variables
- Correlation general lies between -1 and +1
Pearson correlation testing in R
1. Pearson rang correlation coefficient implementation in R:
cor()- Computes correlation coefficient
cor(x, y, method = “pearson”)
where:
x, y: numeric vectors with the same length
method: correlation method
eg:
x = c(1, 2, 3, 4, 5, 6, 7)
y = c(1, 3, 6, 2, 7, 4, 5)
result = cor(x, y, method = “pearson”)
cat(“Pearson correlation coefficient is:”, result)
o/p:
Pearson correlation coefficient is: 0.5357143

cor.test() Computes the test for correlation between spare samples
cor.test(x, y, method = “pearson”)
eg:
x = c(1, 2, 3, 4, 5, 6, 7)
y = c(1, 3, 6, 2, 7, 4, 5)
result = cor.test(x, y, method = “pearson”)
print(result)
o/p:
Pearson’s product-moment correlation
data: x and y
t = 1.4186, df = 5, p-value = 0.2152
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3643187 0.9183058
sample estimates:
cor
0.5357143

How well did you know this?

Not at all

Perfectly

covariance

It also measures relationship between two random variables
It also measures linear dependency between pair of random variables or bivariate data as correlation
cov(x, y, method)
where:
x, y- Represents data vectors
method- Type of method to be used to compute covariance (Default is Pearson)
eg:
x <- c(1, 3, 5, 10)
y <- c(2, 4, 6, 20)
print(cov(x, y))
print(cov(x, y, method = “pearson”))
print(cov(x, y, method = “kendall”))
print(cov(x, y, method = “spearman”))
**Output: **
[1] 30.66667
[1] 30.66667
[1] 12
[1] 1.666667

How well did you know this?

Not at all

Perfectly

Conversion of covariance into correlation in R

Study These Flashcards

cov2cor(x)
x- Covariance squared matrix
eg:
x <- rnorm(2)
y <- rnorm(2)
print(cov2cor(X))
**Output: **
x y
x 0.0742700 -0.1268199
y -0.1268199 0.2165516

x y
x 1 -1
y -1 1

Difference between covariance and correlation

Study These Flashcards

Correlation describes the intensity and direction of the linear link between two variables, whereas covariance shows how much two variables vary together.

https://www.geeksforgeeks.org/covariance-and-correlation-in-r-programming/

Data visualisation

Study These Flashcards

In our we can create visually appealing data visualisations by writing few lines of code
By using the data visualization technique, we can work with large datasets to efficiently obtain key insights about it.

R visualisation packages

Study These Flashcards

Plotly
Ggplot2
tidyquant
taucharts
ggiraph
geofacet
googleVis
RColorBrewer
Dygraphs
Shiny

R graphics

Study These Flashcards

Graphics are used to examine marginal distributions, relationships between variables, and summary of very large data.
standard graphics:
1. scatterplots
2. pie charts
3. line chart
4. barplots
5. Histogram

Key elements of stactical graphic

Study These Flashcards

Data - processed and generates an output.
Aesthetic mappings - It controls the relation between graphics variables and data variables. In a scatter plot, it also helps to map the temperature variable of a data set into the X variable.
geametric objects - express each observation by a point using the aesthetic mappings. It maps two variables in the data set into the x,y variables of the plot.
Statistical transformation - Allows us to calculate statistical analysis of data in the plot it Uses the data and approximates it with the help of regression line having X and Y coordinates and counts occurrences of certain values
Scales- Used to map the data values in values present in coordinates system of the graphic device
6.coordinates systems
-The coordinate system plays an important role in the plotting of the data.
* Cartesian
* Plot
7.Faceting - Faceting is used to split the data into subgroups and draw sub-graphs for each group.

Advantages and disadvantages of data visualisation in R

Study These Flashcards

adv
1. Easy to understand and analyse
2. Is more efficient as its application allows to display lot of information in small space
3. Look utilising features such as geographic maps and gis can be very relevant to wider businesses when the location is very relevant factor as it uses maps to show business insights from various locations

disadv:
1. It costs more
2. It creates distraction data visualisation apps create highly complex and fancy graphic rich reports and charts which may entice users to focus more form on the form than the function

pie chart

Study These Flashcards

Circular chart which is divided into different segments according to the ratio of data
The total value of piles 100 and Segments tell the fraction of whole pie
pie(x, labels, col, main, radius)
x- Data vector
labels- Names for each slice
col- Colour for each slice
Main - title of the pie chart
radius- Radius of the pie chart which is b/w -1 & +1
eg:
exp <- c(1,2,3,4,5)
pie(e, labels=c(6,7,8,9,10), main= “pie”)

Bar chart

Study These Flashcards

Bar chart uses rectangular bars to visualise data
Bar charts can be displayed horizontally or vertically
The length or height of the bats are proportional to the values they represent
barplot()
data Data vector to be represented on y axis
xlab label given to Xaxis
ylab label given to y axis
names.arg:names of each observation in the axis
main()- Title of the barchart

eg:
library(ggplot2)
data <- data.frame(
Category = c(“A”, “B”, “C”, “D”, “E”),
Value = c(10, 20, 15, 25, 30)
)
ggplot(data, aes(x = Category, y = Value)) +
geom_bar(stat = “identity”, fill = “skyblue”) +
labs(title = “Bar Graph Example”,
x = “Category”,
y = “Value”)

scatterplots

1. A "scatter plot" is a type of plot used to display the relationship between two numerical variables, and plots one dot for each observation. 2. It needs two vectors of same length, one for the x-axis (horizontal) and one for the y-axis (vertical) syn: **plot(x, y, type, xlab, ylab, main)** x- data on xaxis y- data on Yaxis **type** Specifies which type of Plot is drawn eg: l-- lines p-- points...... xlab ylab main eg: temperature <- c(65, 70, 75, 80, 85, 90) ice_cream_sold <- c(100, 120, 140, 160, 180, 200) **plot(temperature, ice_cream_sold, xlab = "Temperature (Fahrenheit)", ylab = "Ice Cream Cones Sold", main = "Temperature vs Ice Cream Sales", col = "blue", pch = 16)** **abline(lm(ice_cream_sold ~ temperature), col = "red")**

Histogram

1. a type of bar chart which shows the freq of the number of val which are compared with set of val ranges 2. **The histogram is used for the distribution, whereas a bar chart is used for comparing different entities.** 3. each bar represents the height of the number of values present in the given range. **hist(x, oth parameters)** parameters: **v**- Vector that contains numerical values **Main** - title of the chart **col** - Colour of bars **Border**- Border of each bar **xlab**- describe x axis **yLab**- described y axis **xlim**-Specify range of values on x axis **ylim**- Specify range of values on y axis eg: data <- c(65, 70, 75, 80, 85, 90) hist(data, main = "Histogram of Example Data", xlab = "Value", ylab = "Frequency", col = "lightblue", border = "black")

line graph

1. Used for explanatory data analysis to cheque the data trends by observing the line pattern of the line graph 2. line graphs are used for time series data analysis **plot(x, y, type, col, main, xlab, ylab)** eg: data <- c(65, 70, 75, 80, 85, 90) plot(data, type = "l", col = "blue", main = "Line Graph of Example Data", xlab = "Index", ylab = "Value")

4. descriptive statistics Flashcards

stats + graphs (27 cards)