Statistics Flashcards
What is the central limit theorem?
The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement , then the distribution of the sample means will be approximately normally distributed
What is an outlier? How can outliers be determined in a dataset?
Outliers are data points that vary in a large way when compared to other observations in the dataset.
Depending on the learning process, an outlier can worsen the accuracy of a model and decrease its efficiency sharply.
Outliers are determined by using two methods:
Standard deviation/z-score
Interquartile range (IQR)
How is missing data handled in statistics?
Prediction of the missing values Assignment of individual (unique) values
Deletion of rows, which have the missing data Mean imputation or median imputation
Using random forests, which support the missing value
What is exploratory data analysis?
Exploratory data analysis is the process of performing investigations on data to understand the data better.
In this, initial investigations are done to determine patterns, spot abnormalities, test hypotheses, and also check if the assumptions are right.
What is the meaning of selection bias?
Selection bias is a phenomenon that involves the selection of individual or grouped data in a way that is not considered to be random. Randomization plays a key role in performing analysis and understanding model functionality better.
If correct randomization is not achieved, then the resulting sample will not accurately represent the population.
State the case where the median is a better measure when compared to the mean.
In the case where there are a lot of outliers that can positively or negatively skew data, the median is preferred as it provides an accurate measure in this case of determination.
What type of data does not have a log-normal distribution or a Gaussian distribution?
Exponential distributions do not have a log-normal distribution or a Gaussian distribution. In fact, any type of data that is categorical will not have these distributions as well.
Example: Duration of a phone car, time until the next earthquake, etc.
What is the meaning of the five-number summary in Statistics?
The five-number summary is a measure of five entities that cover the entire range of data as shown below:
Low extreme (Min)
First quartile (Q1)
Median
Upper quartile (Q3)
High extreme (Max)
What are population and sample in Inferential Statistics, and how are they different?
A population is a large volume of observations (data). The sample is a small portion of that population. Because of the large volume of data in the population, it raises the computational cost. The availability of all data points in the population is also an issue.
In short:
We calculate the statistics using the sample.
Using these sample statistics, we make conclusions about the population.
What is skewness?
Skewness measures the lack of symmetry in a data distribution. It indicates that there are significant differences between the mean, the mode, and the median of data. Skewed data cannot be used to create a normal distribution.
What is kurtosis?
Kurtosis is used to describe the extreme values present in one tail of distribution versus the other. It is actually the measure of outliers present in the distribution. A high value of kurtosis represents large amounts of outliers being present in data. To overcome this, we have to either add more data into the dataset or remove the outliers.
What is correlation?
Correlation is used to test relationships between quantitative variables and categorical variables. Unlike covariance, correlation tells us how strong the relationship is between two variables. The value of correlation between two variables ranges from -1 to +1.
What are left-skewed and right-skewed distributions?
A left-skewed distribution is one where the left tail is longer than that of the right tail. Here, it is important to note that the mean < median < mode.
Similarly, a right-skewed distribution is one where the right tail is longer than the left one. But, here mean > median > mode.
What is the difference between Descriptive and Inferential Statistics?
Descriptive Statistics: Descriptive statistics is used to summarize a sample set of data like the standard deviation or the mean.
Inferential statistics: Inferential statistics is used to draw conclusions from the test data that are subjected to random variations.
What are the types of sampling in Statistics?
Simple random: Pure random division
Cluster: Population divided into clusters
Stratified: Data divided into unique groups
Systematical: Picks up every ‘n’ member in the data
What is the meaning of covariance?
Covariance is the measure of indication when two items vary together in a cycle. The systematic relation is determined between a pair of random variables to see if the change in one will affect the other variable in the pair or not
If a distribution is skewed to the right and has a median of 20, will the mean be greater than or less than 20?
If the given distribution is a right-skewed distribution, then the mean should be greater than 20, while the mode remains to be less than 20.
The standard normal curve has a total area to be under one, and it is symmetric around zero. True or False?
True, a normal curve will have the area under unity and the symmetry around zero in any distribution. Here, all of the measures of central tendencies are equal to zero due to the symmetric nature of the standard normal curve.
In an observation, there is a high correlation between the time a person sleeps and the amount of productive work he does. What can be inferred from this?
First, correlation does not imply causation here. Correlation is only used to measure the relationship, which is linear between rest and productive work. If both vary rapidly, then it means that there is a high amount of correlation between them.
What is the relationship between the confidence level and the significance level in statistics?
The significance level is the probability of obtaining a result that is extremely different from the condition where the null hypothesis is true. While the confidence level is used as a range of similar values in a population.
Both significance and confidence level are related by the following formula:
Significance level = 1 − Confidence level
What are the examples of symmetric distribution?
Symmetric distribution means that the data on the left side of the median is the same as the one present on the right side of the median.
There are many examples of symmetric distribution, but the following three are the most widely used ones:
Uniform distribution
Binomial distribution
Normal distribution
What is the relationship between mean and median in a normal distribution?
In a normal distribution, the mean is equal to the median. To know if the distribution of a dataset is normal, we can just check the dataset’s mean and median.
What is the difference between the Ist quartile, the IInd quartile, and the IIIrd quartile?
Quartiles are used to describe the distribution of data by splitting data into three equal portions, and the boundary or edge of these portions are called quartiles.
That is,
The lower quartile (Q1) is the 25th percentile.
The middle quartile (Q2), also called the median, is the 50th percentile.
The upper quartile (Q3) is the 75th percentile.
How do the standard error and the margin of error relate?
Margin of error = Z * Standard error/ root(n)
Therefore margin of error increases when standard error increases