Statistics Flashcards

1
Q

How do we use statistics in data science?

A

Data scientists use statistics to gather, review, analyze, and draw conclusions from data, as well as apply quantified mathematical models to appropriate variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the Central Tendency?

A

A descriptive summary of a dataset through a single value that reflects the centre of the data distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the mean?

A

The average value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the median?

A

The middle value when data is ordered from highest to lowest. If the number of elements is even, it is the mean of the two values in the middle. The median is not affected by outliers, unlike the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the mode?

A

The number that occurs the most frequently. If there isn’t a single mode, the set is multimodal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is variability?

A

The extent to which data points diverge from the average value, and from each other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the range?

A

The difference between the highest and lowest values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the interquartile range?

A

The spread of the middle half of a distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is standard deviation?

A

The average distance from the mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is variance?

A

The average of square distances from the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is correlation?

A

The strength and direction of the relationship between two or more variables in a dataset. Positive correlation is when larger values of x correspond to larger values of y. Negative correlation is the opposite. Weak or no correlation is if there is no relationship.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is covariance?

A

A measure of the joint variability of two variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the correlation coefficient?

A

A statistical measure of the strength of the linear relationship between two variables
np.corrcoef

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the population?

A

A set of all the elements you’re interested in.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a sample?

A

A representative subset of the population - it should preserve the essential statistical features of the overall population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is an outlier?

A

An observation that lies an abnormal distance from other values in a dataset.

They are caused by:

  • errors in data entry or measurement
  • sampling problems and unusual conditions
  • natural variation
17
Q

What are percentiles?

A

Percentiles indicate the percentage of scores that fall below a particular value

18
Q

What are quartiles?

A

Percentiles that divide the dataset into four parts. Each dataset has three quartiles.
The first quartile divides 25% of the smallest items from the rest of the dataset.
The second quartile is the median. Approximately 25% of the items lie between the first and second quartile and another 25% between the second and third quartiles.
The third quartile divides 25% of the largest items from the rest of the dataset

statistics.quantiles(x,n=4)
np.percentile(y,95)

19
Q

What is a joint plot?

A

Jointplot comprises three plots. A bivariate graph which shows the relationship between x and y. A histogram at the top showing the distribution of x, a histogram on the right showing the distribution of y.
sns.joinplot