Statistics Flashcards

1
Q

What is the central limit theorem?

A

The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement , then the distribution of the sample means will be approximately normally distributed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is an outlier? How can outliers be determined in a dataset?

A

Outliers are data points that vary in a large way when compared to other observations in the dataset.
Depending on the learning process, an outlier can worsen the accuracy of a model and decrease its efficiency sharply.

Outliers are determined by using two methods:
Standard deviation/z-score
Interquartile range (IQR)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How is missing data handled in statistics?

A

Prediction of the missing values Assignment of individual (unique) values
Deletion of rows, which have the missing data Mean imputation or median imputation
Using random forests, which support the missing value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is exploratory data analysis?

A

Exploratory data analysis is the process of performing investigations on data to understand the data better.
In this, initial investigations are done to determine patterns, spot abnormalities, test hypotheses, and also check if the assumptions are right.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the meaning of selection bias?

A

Selection bias is a phenomenon that involves the selection of individual or grouped data in a way that is not considered to be random. Randomization plays a key role in performing analysis and understanding model functionality better.

If correct randomization is not achieved, then the resulting sample will not accurately represent the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

State the case where the median is a better measure when compared to the mean.

A

In the case where there are a lot of outliers that can positively or negatively skew data, the median is preferred as it provides an accurate measure in this case of determination.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What type of data does not have a log-normal distribution or a Gaussian distribution?

A

Exponential distributions do not have a log-normal distribution or a Gaussian distribution. In fact, any type of data that is categorical will not have these distributions as well.
Example: Duration of a phone car, time until the next earthquake, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the meaning of the five-number summary in Statistics?

A

The five-number summary is a measure of five entities that cover the entire range of data as shown below:
Low extreme (Min)
First quartile (Q1)
Median
Upper quartile (Q3)
High extreme (Max)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are population and sample in Inferential Statistics, and how are they different?

A

A population is a large volume of observations (data). The sample is a small portion of that population. Because of the large volume of data in the population, it raises the computational cost. The availability of all data points in the population is also an issue.
In short:

We calculate the statistics using the sample.
Using these sample statistics, we make conclusions about the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is skewness?

A

Skewness measures the lack of symmetry in a data distribution. It indicates that there are significant differences between the mean, the mode, and the median of data. Skewed data cannot be used to create a normal distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is kurtosis?

A

Kurtosis is used to describe the extreme values present in one tail of distribution versus the other. It is actually the measure of outliers present in the distribution. A high value of kurtosis represents large amounts of outliers being present in data. To overcome this, we have to either add more data into the dataset or remove the outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is correlation?

A

Correlation is used to test relationships between quantitative variables and categorical variables. Unlike covariance, correlation tells us how strong the relationship is between two variables. The value of correlation between two variables ranges from -1 to +1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are left-skewed and right-skewed distributions?

A

A left-skewed distribution is one where the left tail is longer than that of the right tail. Here, it is important to note that the mean < median < mode.
Similarly, a right-skewed distribution is one where the right tail is longer than the left one. But, here mean > median > mode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the difference between Descriptive and Inferential Statistics?

A

Descriptive Statistics: Descriptive statistics is used to summarize a sample set of data like the standard deviation or the mean.
Inferential statistics: Inferential statistics is used to draw conclusions from the test data that are subjected to random variations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the types of sampling in Statistics?

A

Simple random: Pure random division
Cluster: Population divided into clusters
Stratified: Data divided into unique groups
Systematical: Picks up every ‘n’ member in the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the meaning of covariance?

A

Covariance is the measure of indication when two items vary together in a cycle. The systematic relation is determined between a pair of random variables to see if the change in one will affect the other variable in the pair or not

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

If a distribution is skewed to the right and has a median of 20, will the mean be greater than or less than 20?

A

If the given distribution is a right-skewed distribution, then the mean should be greater than 20, while the mode remains to be less than 20.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

The standard normal curve has a total area to be under one, and it is symmetric around zero. True or False?

A

True, a normal curve will have the area under unity and the symmetry around zero in any distribution. Here, all of the measures of central tendencies are equal to zero due to the symmetric nature of the standard normal curve.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

In an observation, there is a high correlation between the time a person sleeps and the amount of productive work he does. What can be inferred from this?

A

First, correlation does not imply causation here. Correlation is only used to measure the relationship, which is linear between rest and productive work. If both vary rapidly, then it means that there is a high amount of correlation between them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the relationship between the confidence level and the significance level in statistics?

A

The significance level is the probability of obtaining a result that is extremely different from the condition where the null hypothesis is true. While the confidence level is used as a range of similar values in a population.
Both significance and confidence level are related by the following formula:
Significance level = 1 − Confidence level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the examples of symmetric distribution?

A

Symmetric distribution means that the data on the left side of the median is the same as the one present on the right side of the median.
There are many examples of symmetric distribution, but the following three are the most widely used ones:
Uniform distribution
Binomial distribution
Normal distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the relationship between mean and median in a normal distribution?

A

In a normal distribution, the mean is equal to the median. To know if the distribution of a dataset is normal, we can just check the dataset’s mean and median.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the difference between the Ist quartile, the IInd quartile, and the IIIrd quartile?

A

Quartiles are used to describe the distribution of data by splitting data into three equal portions, and the boundary or edge of these portions are called quartiles.
That is,
The lower quartile (Q1) is the 25th percentile.
The middle quartile (Q2), also called the median, is the 50th percentile.
The upper quartile (Q3) is the 75th percentile.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How do the standard error and the margin of error relate?

A

Margin of error = Z * Standard error/ root(n)
Therefore margin of error increases when standard error increases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is one sample t-test?

A

This T-test is a statistical hypothesis test in which we check if the mean of the sample data is statistically or significantly different from the population’s mean.

26
Q

What is an alternative hypothesis?

A

The alternative hypothesis (denoted by H1) is the statement that must be true if the null hypothesis is false. That is, it is a statement used to contradict the null hypothesis. It is the opposing point of view that gets proven right when the null hypothesis is proven wrong.

27
Q

Given a left-skewed distribution that has a median of 60, what conclusions can we draw about the mean and the mode of the data?

A

Given that it is a left-skewed distribution, the mean will be less than the median, i.e., less than 60, and the mode will be greater than 60.

28
Q

What are the types of biases that we encounter while sampling?

A

Sampling biases are errors that occur when taking a small sample of data from a large population as the representation in statistical analysis. There are three types of biases:
The selection bias
The survivorship bias
The undercoverage bias

29
Q

Briefly explain the procedure to measure the length of all sharks in the world.

A

Define the confidence level (usually around 95%)
Use sample sharks to measure
Calculate the mean and standard deviation of the lengths
Determine t-statistics values
Determine the confidence interval in which the mean length lies

30
Q

How does the width of the confidence interval change with length?

A

The width of the confidence interval is used to determine the decision-making steps. As the confidence level increases, the width also increases.
The following also apply:
Wide confidence interval: Useless information
Narrow confidence interval: High-risk factor

31
Q

What is the meaning of degrees of freedom (DF) in statistics?

A

Degrees of freedom or DF is used to define the number of options at hand when performing an analysis. It is mostly used with t-distribution and not with the z-distribution.

If there is an increase in DF, the t-distribution will reach closer to the normal distribution. If DF > 30, this means that the t-distribution at hand is having all of the characteristics of a normal distribution.

32
Q

What is the law of large numbers in statistics?

A

The law of large numbers in statistics is a theory that states that the increase in the number of trials performed will cause a positive proportional increase in the average of the results becoming the expected value.

Example: The probability of flipping a fair coin and landing heads is closer to 0.5 when it is flipped 100,000 times when compared to 100 flips.

33
Q

What are some of the properties of a normal distribution?

A

A normal distribution, regardless of its size, will have a bell-shaped curve that is symmetric along the axes.
Following are some of the important properties:
Unimodal: It has only one mode.
Symmetrical: Left and right halves of the curve are mirrored.
Central tendency: The mean, median, and mode are at the midpoint.

34
Q

What are some of the low and high-bias Machine Learning algorithms?

A

Low bias: SVM, decision trees, KNN algorithm, etc.
High bias: Linear and logistic regression

35
Q

What is the use of Hash tables in statistics?

A

Hash tables are the data structures that are used to denote the representation of key-value pairs in a structured way. The hashing function is used by a hash table to compute an index that contains all of the details regarding the keys that are mapped to their associated values.

36
Q

What are some of the techniques to reduce underfitting and overfitting during model training?

A

For reducing underfitting:
Increase model complexity
Increase the number of features
Remove noise from the data
Increase the number of training epochs

For reducing overfitting:
Increase training data
Stop early while training
Lasso regularization
Use random dropouts

37
Q

What are the types of data?

A

Categorical – Describe category or groups
Example – Car Brands
Numerical – Represent numbers
These are of two types:
Discrete
Example – Grade, Number of Objects
Continuous
Example – Weight, Height, Area

38
Q

Difference between Population and Sample?

A

The Population is a collection of all items of interest while the Sample is the subset of the population. The numbers obtained from the population are called Parameters while the numbers obtained from the sample are called Statistics. Sample data are used to make conclusions on Population data.

39
Q

What are the Measures of Central Tendency?

A

The measure of central tendency is a single value that describes(represents) the central position within the dataset. Three most common measures of central tendency are Mean, Median, and Mode.

40
Q

What are the Measures of Dispersion?

A

Dispersion or variability describes how items are distributed from each other and the centre of a distribution.

The measure of dispersion is a statistical method that helps to know how the data points are spread in the dataset.

There are 4 methods to measure the dispersion of the data:
Range
Interquartile Range
Variance
Standard Deviation

41
Q

What is the difference between Probability and Likelihood?

A

Probability: Only two possibilities
Choose to Bat
Doesn’t choose to Bat
P(choose to bat) = P(doesn’t choose to bat) = ½ = 0.5

Likelihood: Choosing to bat first will depend on
Weather Conditions ( Rainfall, wind speed)
Due on Pitch
Humidity

42
Q

What is the empirical rule?

A

Empirical Rule is often called the 68 – 95 – 99.7 rule or Three Sigma Rule. It states that on a Normal Distribution:
68% of the data will be within one Standard Deviation of the Mean

95% of the data will be within two Standard Deviations of the Mean

99.7 of the data will be within three Standard Deviations of the Mean

43
Q

What is Resampling and what are the common methods of resampling?

A

Resampling is the method that consists of drawing repeatedly drawing samples from the population.
It involves the selection of randomized cases with replacements from samples.

There are two types of resampling methods:
K-fold cross-validation
Bootstrapping

44
Q

What is Hypothesis Testing?

A

Hypothesis testing is a form of statistical inference that uses data from a sample to draw conclusions about a population parameter or a population probability distribution.

There are 3 steps in Hypothesis Testing:
State Null and Alternate Hypothesis
Perform Statistical Test
Accept or reject the Null Hypothesis

45
Q

What is the Null and Alternate Hypothesis?

A

A null and alternate hypothesis is used in statistical hypothesis testing.

Null Hypothesis
It states that the population parameter is equal to the assumed value
It is an initial claim based on previous analysis or experience

Alternate Hypothesis
It states that population parameters are equal or different to the assumed value
It is what you might believe to be true or want to prove true

46
Q

What is a p-value and its role in Hypothesis Testing?

A

In statistics, the p-value is the probability of obtaining results at least as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is correct.

A p-value is a statistical measurement used to validate a hypothesis against observed data.
A p-value measures the probability of obtaining the observed results, assuming that the null hypothesis is true.
The lower the p-value, the greater the statistical significance of the observed difference.
A p-value of 0.05 or lower is generally considered statistically significant.
P-value can serve as an alternative to or in addition to preselected confidence levels for hypothesis testing.

47
Q

What Chi-square test?

A

A statistical method is used to find the difference or correlation between the observed and expected categorical variables in the dataset.

Example: A food delivery company wants to find the relationship between gender, location and food choices of people in India.

It is used to determine whether the difference between 2 categorical variables is:
Due to chance or
Due to relationship

48
Q

What is Bayes’ Theorem

A

In probability theory and statistics, Bayes’ Theorem refers to the probability of an event based on conditions that exist. Essentially, the theorem allows us to update our beliefs about a random event based on what we know about the event.

For example, if the risk of customer churn increases the longer a user has been inactive, Bayes’ Theorem allows us to more accurately assess the churn risk for users, because we can condition the probability of churn to how long the user has been inactive.

P(A|B) = P(B|A)*P(A)/P(B)

49
Q

How do Probability Mass Functions and Probability Density Functions differ?

A

Probability mass functions describe discrete distributions. Using probability mass functions, we can determine the probability of an event to be equal to a target value. In other words, we are sure that an event will always equal x.

Density mass functions describe continuous probability distributions. Using density mass functions, we can determine the probability of an event within a range around the target value, which can be found by calculating the area under the interval curve

50
Q

What is the difference between independent and dependent events in probability? Provide an example for each.

A

ndependent events do not affect the outcome of another event, while dependent events do affect the other’s outcomes.

For example, if you were asked to toss a coin 100 times, a coin flip would be an independent event because the probability of each successive flip would remain 50-50. Getting heads on the first flip does not influence your chances of either a heads or tails on the second. Drawing a card from a deck (without replacement) would be a dependent event, because with each draw the deck gets smaller, affecting the outcome of each successive draw.

51
Q

What is the difference between discrete and continuous variables? Provide an example for each.

A

Discrete variables are countable, while continuous variables are measurable. A discrete variable would be the number of faberge eggs created, there are only so many in the world, and no more are being produced. Other examples of a discrete variable could be the number of students in a class, or the amount of money in your wallet. A continuous variable, on the other hand, would be something like age, because you could continue to count it forever, e.g. I am 33 years old, 9 months, 2 days, 5 hours, 4 seconds…. on and on.The continuous values are infinitely divisible.

You can turn a continuous variable into a discrete variable, by making it countable. For example, you could count a toddler’s age in months.

52
Q

What is Hypothesis testing?

A

Hypothesis testing is a form of statistical inference that uses data from a sample to draw conclusions about a population parameter or a population probability distribution. First, a tentative assumption is made about the parameter or distribution. This assumption is called the null hypothesis and is denoted by H0. An alternative hypothesis (denoted Ha), which is the opposite of what is stated in the null hypothesis, is then defined. The hypothesis-testing procedure involves using sample data to determine whether or not H0 can be rejected. If H0 is rejected, the statistical conclusion is that the alternative hypothesis Ha is true.

53
Q

What is Type 1 error and Type 11 error?

A

A Type I error is a false positive conclusion, while a Type II error is a false negative conclusion.

54
Q

What is Alpha and Beta in Hypothesis testing?

A

α (Alpha) is the probability of Type I error in any hypothesis test–incorrectly rejecting the null hypothesis.

β (Beta) is the probability of Type II error in any hypothesis test–incorrectly failing to reject the null hypothesis. (1 – β is power).

55
Q

What is One- and two-tailed alternative hypotheses

A

A one-tailed (or one-sided) hypothesis specifies the direction of the association between the predictor and outcome variables. The prediction that patients of attempted suicides will have a higher rate of use of tranquilizers than control patients is a one-tailed hypothesis.

A two-tailed hypothesis states only that an association exists; it does not specify the direction. The prediction that patients with attempted suicides will have a different rate of tranquilizer use — either higher or lower than control patients — is a two-tailed hypothesis.

56
Q

What is a normal distribution?

A

Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean.

57
Q

You call 2 UberX’s and 3 Lyfts. If the time that each takes to reach you is IID, what is the probability that all the Lyfts arrive first? What is the probability that all the UberX’s arrive first?

A

All Lyft’s first

probability that the first car is Lyft = 3/5
probability that the second car is Lyft = 2/4
probability that the third car is Lyft = 1/3 Therefore, probability that all the Lyfts arrive first = (3/5) * (2/4) * (1/3) = 1/10

All Uber’s first

probability that the first car is Uber = 2/5
probability that the second car is Uber = 1/4 Therefore, probability that all the Ubers arrive first = (2/5) * (1/4) = 1/10

58
Q

On a dating site, users can select 5 out of 24 adjectives to describe themselves. A match is declared between two users if they match on at least 4 adjectives. If Alice and Bob randomly pick adjectives, what is the probability that they form a match?

A

((5C4.19C1) + 5C5)/24C5

= 0.002
= 0.2%

59
Q

You have two coins, one of which is fair and comes up heads with a probability 1/2, and the other which is biased and comes up heads with probability 3/4. You randomly pick coin and flip it twice, and get heads both times. What is the probability that you picked the fair coin?

A

4/13

60
Q

Bobo the amoeba has a 25%, 25%, and 50% chance of producing 0, 1, or 2 offspring, respectively. Each of Bobo’s descendants also have the same probabilities. What is the probability that Bobo’s lineage dies out?

A

p=1/4+1/4p+1/2p^2 => p=1/2