General DS questions Flashcards

1
Q

What are the assumptions required for a linear regression?

A
  • **Linear Relationship **between independent variable x and dependent variable y
  • Independence: the value of one observation should not depend on or be affected by the value of another observation. (i.e. people height and weight are independent – one person’s height does affect another person’s weight. However if we measure the same person’s weight multiple times during the day those measurements will be related
  • Homoscedasticity: the variation in the errors (the difference between the actual and predicted values) is the same no matter what value the independent variable takes. i.e. if we are predicting people’s weights, the error should be roughly the same for tall and short people
  • Normality: the errors (the differences between the actual valus and the predicted values) should be normally distributed. It he
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Ho do you explain technical aspects of your results to stakeholders with no technical background?

A

Start with a short answer and then give a more elaborated answer
- Need to know stakeholder’s background and understand the level
- Need to use visuals and graphs
- Focus on the result ad the implications rather than the methodology
- Provide a summary
- Create room for questions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can you avoid overfitting the model?

A

Overfitting – a model trained too well on a training dataset but fails on the test and validation dataset
- Keeping the model simple, taking fewer variables and parameters
- Using cross-validation techniques
- Training with more data
- Using data augmentation that increases the number of samples
- Using ensembling (Bagging and boosting)
- Using regularization techniques to penalize certain model parameters if they are likely to cause overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

List different types of relationships in SQL

A
  • One to one: i.e. EmployeeID (one table has employee id and names, another Employee ID and job descriptions)
  • One to many: a table with departments, a table with employees in each department.
  • Many to Many: a student can enroll into multiple courses and a course can have multiple students
  • Self referencing: a table declares a connection to itself
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the goal of A/B testing

A

Summary: A statistical method to compare two versions (e.g., of a web page) to see which one performs better.
Key Terms:
* Control Group: The group that does not receive the treatment.
* Treatment Group: The group that receives the treatment being tested.

Eliminates the guesswork and helps make data-driven decisions to optimize the product or website
randomized experiments are conducted to analyze two or more versions of variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Probability and What Are Distributions?

A

Summary: Probability is the study of randomness and uncertainty. Distributions show how data points are spread out.

Key Terms:
* Normal Distribution (Bell Curve): A symmetric distribution where most values cluster around a central mean, with fewer values toward the extremes.
* ** Binomial Distribution: **A distribution representing the number of successes in a fixed number of independent trials, with a constant probability of success.
* ** Uniform Distribution: **A distribution where all outcomes are equally likely.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are Descriptive Statistics?

A

Summary: These are basic statistics that summarize and describe the features of a dataset.

Key Terms:
Mean: The average value of a dataset.
Median: The middle value when the data is sorted.
Mode: The most frequently occurring value in the dataset.
Variance: A measure of how much the values in a dataset vary from the mean.
Standard Deviation: The average distance of each data point from the mean, indicating how spread out the data is.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Regression Analysis?

A

Summary: A method to model and analyze relationships between variables, often used for prediction.

Key Types:
* Linear Regression: Predicts a continuous outcome by finding the best-fitting line through the data points.
* **Logistic Regression: **Predicts a binary outcome (e.g., yes/no) using a logistic function to model the probability of a particular class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Clustering?

A

Summary: Grouping data points into clusters based on similarity, often used for exploratory data analysis.

Key Algorithms:
* **k-Means Clustering: **Partitions the data into k clusters where each data point belongs to the cluster with the nearest mean.
* Hierarchical Clustering: Builds a hierarchy of clusters by either merging or splitting them based on similarity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What Are Inferential Statistics?

A

Summary: Techniques that allow you to make predictions or inferences about a population based on a sample of data.

Key Terms:
* **Hypothesis Testing: **A method for testing a hypothesis about a parameter in a population using data.
* **p-Value: **The probability of observing the data if the null hypothesis is true; a low p-value suggests that the null hypothesis may not be true.
* ** Confidence Intervals: A range of values that is likely to contain the population parameter, providing an estimate with a level of confidence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Classification in Machine Learning?

A

Summary: A type of machine learning where the goal is to predict categories or labels (e.g., spam or not spam).

Key Algorithms:
* ** Decision Trees:** A model that splits the data into branches to make decisions based on feature values.
* **Random Forests: **An ensemble method that builds multiple decision trees and merges their results for better accuracy.
* Support Vector Machines (SVM): A model that finds the best boundary (hyperplane) to separate different classes in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the Difference Between Correlation and Causation?

A

Summary: Correlation measures the relationship between two variables, while causation implies that one variable causes a change in another.

Key Terms:
** Correlation Coefficient:** A value ranging from -1 to 1 that indicates the strength and direction of the relationship between two variables; 1 means perfect positive correlation, -1 means perfect negative correlation, and 0 means no correlation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Dimensionality Reduction?

A

Summary: Reducing the number of variables (features) in a dataset while retaining its essential information.

Goal:
- reducing storage
- reducing computational time
- removing redundant features

Key Techniques:
* Principal Component Analysis (PCA): Transforms the data into a set of linearly uncorrelated components, ordered by the amount of variance they capture.
* ** t-SNE (t-Distributed Stochastic Neighbor Embedding):** A technique for reducing the dimensions of data while preserving its structure, often used for visualizing high-dimensional data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Cross-Validation?

A

Summary: A technique to assess how a model generalizes to an independent dataset, often used to prevent overfitting.

Key Terms:
** k-Fold Cross-Validation: **The data is split into k subsets, and the model is trained and validated k times, each time using a different subset as the validation set and the remaining as the training set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What Are Model Evaluation Metrics?

A

Summary: Metrics to evaluate the performance of your models.
Key Metrics:
* Accuracy: The ratio of correctly predicted instances to the total instances.
* Precision: The ratio of correctly predicted positive observations to the total predicted positives.
* Recall: The ratio of correctly predicted positive observations to all the observations in the actual class.
* ** F1 Score: **The harmonic mean of precision and recall, providing a balance between them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What Are Overfitting and Underfitting?

A

Summary: Overfitting happens when a model is too complex and captures noise instead of the pattern, while underfitting occurs when a model is too simple to capture the pattern in data.
Key Terms:
* Overfitting: When the model performs well on training data but poorly on new data.
* Underfitting: When the model performs poorly on both training data and new data.

11
Q

What is Feature Engineering?

A

Summary: The process of selecting, modifying, or creating features (variables) to improve model performance.

Key Techniques:
* Binning: Grouping continuous values into discrete bins.
* Polynomial Features: Creating new features by raising existing features to a power.

12
Q

What is Data Preprocessing?

A

Summary: Preparing raw data for analysis by cleaning, transforming, and organizing it.

Key Techniques:
* Handling Missing Values: Filling in or removing missing data points.
* **Scaling/Normalizing Data: **Adjusting the range of data to make it consistent (e.g., between 0 and 1).
* **Encoding Categorical Variables: **Converting categories into numerical values (e.g., one-hot encoding).

13
Q

What is a normal distribution, and why is it important in data science? Can you explain other distributions like binomial and uniform?

A
  • **Normal Distribution: **A symmetric, bell-shaped distribution where most values cluster around the mean. It’s important because many statistical methods assume normality.
  • ** Binomial Distribution:** Represents the number of successes in a fixed number of independent trials, with a constant probability of success. Useful for binary outcomes (e.g., flipping a coin).
    **
    Uniform Distribution: **All outcomes are equally likely. It represents complete randomness (e.g., rolling a fair die).
  • Importance: Understanding distributions helps in selecting the right statistical methods and making predictions.
14
Q

What is hypothesis testing, and how are p-values and confidence intervals used in this context?

A
  • Hypothesis Testing: Hypothesis Testing is a way to make decisions or draw conclusions about a population based on sample data. It involves making an initial assumption (called the null hypothesis) and then using data to test whether this assumption is likely true or false.

*** p-Value: **The p-value tells you how likely it is to get the observed data (or something more extreme) if the null hypothesis is true.
A small p-value (typically less than 0.05) means that what you observed in your data is unlikely to have happened by chance. So, you might consider rejecting the null hypothesis.

* Confidence Intervals: gives you a range of values where you expect the true population parameter to lie.

Importance: These concepts help in making informed decisions based on sample data, assessing the strength of evidence against the null hypothesis.

15
Q

How does linear regression work, and what are the assumptions behind it? What is the difference between linear and logistic regression?

A

A technique to predict a continuous outcome (e.g., house prices) by drawing the best-fitting straight line through the data points. The goal is to find the line (y = mx + b) that minimizes the distance between the actual data points and the line itself.

Assumptions:
* Linearity: The relationship between the independent variable (e.g., size of the house) and the dependent variable (e.g., price) is straight-line.
* Independence: Each observation should be independent; one doesn’t affect another.
* Homoscedasticity: The spread of the residuals (differences between actual and predicted values) should be roughly the same across all levels of the independent variable.
* Normality: The residuals should be normally distributed.

**Logistic Regression: **Predicts a probability for binary outcomes (e.g., yes/no, spam/not spam).
Instead of predicting a continuous value, it estimates the probability of an event occurring, which is then converted into a binary result using a logistic function (S-shaped curve) to produce values between 0 and 1.

Difference: Linear regression predicts continuous values, while logistic regression predicts probabilities for binary outcomes.

16
Q

Can you explain accuracy, precision, recall, and the F1 score? When would you prioritize one metric over another?

A
  • Accuracy: The ratio of correctly predicted instances to the total instances. Useful when classes are balanced.
  • Precision: The ratio of correctly predicted positive observations to the total predicted positives. Important when false positives are costly (e.g., spam detection).
  • Recall: The ratio of correctly predicted positive observations to all the observations in the actual class. Crucial when false negatives are costly (e.g., disease detection).
    * F1 Score: The harmonic mean of precision and recall. It provides a balance, useful when you need to consider both false positives and false negatives.
  • *
17
Q

What are some common classification algorithms (e.g., decision trees, random forests, SVM) and when would you use them?

A
  • Decision Trees: A model that splits the data into branches based on feature values. Useful for interpreting and understanding decision rules.
  • **Random Forests: **An ensemble method that builds multiple decision trees and merges their results. Great for improving accuracy and handling overfitting.
  • Support Vector Machines (SVM):* Finds the best boundary (hyperplane) to separate classes. Effective for high-dimensional data and cases where the classes are not linearly separable.

Usage:
* Use decision trees when you need interpretability and a simple decision-making process.
* Use random forests for better accuracy and when dealing with noisy data.
* Use SVM when you have complex, high-dimensional data and need a robust classifier.

18
Q

What are some common feature engineering techniques, and why is data preprocessing important?

A

Feature Engineering Techniques:

  • Binning: Grouping continuous values into discrete bins to reduce noise and handle non-linearity.
  • **Polynomial Features: **Creating new features by raising existing features to a power to capture non-linear relationships.
  • ** Encoding Categorical Variables:** Converting categories into numerical values (e.g., one-hot encoding).

Data Preprocessing Importance:
Handling missing values, scaling, and transforming features are crucial for model performance and ensuring the model interprets the data correctly.

18
Q

How does cross-validation help in model evaluation, and why is it important? What is hyperparameter tuning?

A

**Cross-Validation: **A technique to assess how a model generalizes to an independent dataset by splitting the data into training and validation sets multiple times. Commonly used method is k-fold cross-validation.

Importance: It provides a more reliable estimate of model performance, helping to avoid overfitting and selecting the best model.

**Hyperparameter Tuning: **The process of finding the optimal set of hyperparameters (parameters not learned from data) for a model. This can be done using methods like grid search or random search.
Importance: Proper tuning helps in improving model performance and generalizability.

19
Q

What is propensity score matching, and how does it help in making causal inferences?

A

**Propensity Score Matching (PSM): **A statistical technique used to create a balanced comparison between a treatment group and a control group by matching units with similar characteristics. This helps in reducing bias when estimating the treatment effect.

How It Helps: PSM helps in isolating the effect of the treatment by ensuring that the treatment and control groups are similar in terms of observable characteristics, making it easier to make causal inferences.

Importance: It allows for more accurate evaluation of the treatment’s impact by controlling for confounding variables and reducing selection bias.

20
Q

What are the most common statistical tests? Explain when each one should be used

A
  • Chi-Square Test: Use for associations between categorical variables.
  • t-Test: Use for comparing means between two groups.
  • ANOVA: Use for comparing means between three or more groups.
  • Correlation Coefficients (Pearson/Spearman): Use for measuring the strength and direction of relationships between variables.
  • Regression: Use for predicting a dependent variable from one or more independent variables.
  • Mann-Whitney U: Use for comparing two independent groups with non-normal data.
  • Wilcoxon Signed-Rank: Use for comparing two related groups or repeated measures with non-normal data.