General DS questions Flashcards
What are the assumptions required for a linear regression?
- **Linear Relationship **between independent variable x and dependent variable y
- Independence: the value of one observation should not depend on or be affected by the value of another observation. (i.e. people height and weight are independent – one person’s height does affect another person’s weight. However if we measure the same person’s weight multiple times during the day those measurements will be related
- Homoscedasticity: the variation in the errors (the difference between the actual and predicted values) is the same no matter what value the independent variable takes. i.e. if we are predicting people’s weights, the error should be roughly the same for tall and short people
- Normality: the errors (the differences between the actual valus and the predicted values) should be normally distributed. It he
Ho do you explain technical aspects of your results to stakeholders with no technical background?
Start with a short answer and then give a more elaborated answer
- Need to know stakeholder’s background and understand the level
- Need to use visuals and graphs
- Focus on the result ad the implications rather than the methodology
- Provide a summary
- Create room for questions
How can you avoid overfitting the model?
Overfitting – a model trained too well on a training dataset but fails on the test and validation dataset
- Keeping the model simple, taking fewer variables and parameters
- Using cross-validation techniques
- Training with more data
- Using data augmentation that increases the number of samples
- Using ensembling (Bagging and boosting)
- Using regularization techniques to penalize certain model parameters if they are likely to cause overfitting
List different types of relationships in SQL
- One to one: i.e. EmployeeID (one table has employee id and names, another Employee ID and job descriptions)
- One to many: a table with departments, a table with employees in each department.
- Many to Many: a student can enroll into multiple courses and a course can have multiple students
- Self referencing: a table declares a connection to itself
What is the goal of A/B testing
Summary: A statistical method to compare two versions (e.g., of a web page) to see which one performs better.
Key Terms:
* Control Group: The group that does not receive the treatment.
* Treatment Group: The group that receives the treatment being tested.
Eliminates the guesswork and helps make data-driven decisions to optimize the product or website
randomized experiments are conducted to analyze two or more versions of variables
What is Probability and What Are Distributions?
Summary: Probability is the study of randomness and uncertainty. Distributions show how data points are spread out.
Key Terms:
* Normal Distribution (Bell Curve): A symmetric distribution where most values cluster around a central mean, with fewer values toward the extremes.
* ** Binomial Distribution: **A distribution representing the number of successes in a fixed number of independent trials, with a constant probability of success.
* ** Uniform Distribution: **A distribution where all outcomes are equally likely.
What are Descriptive Statistics?
Summary: These are basic statistics that summarize and describe the features of a dataset.
Key Terms:
Mean: The average value of a dataset.
Median: The middle value when the data is sorted.
Mode: The most frequently occurring value in the dataset.
Variance: A measure of how much the values in a dataset vary from the mean.
Standard Deviation: The average distance of each data point from the mean, indicating how spread out the data is.
What is Regression Analysis?
Summary: A method to model and analyze relationships between variables, often used for prediction.
Key Types:
* Linear Regression: Predicts a continuous outcome by finding the best-fitting line through the data points.
* **Logistic Regression: **Predicts a binary outcome (e.g., yes/no) using a logistic function to model the probability of a particular class.
What is Clustering?
Summary: Grouping data points into clusters based on similarity, often used for exploratory data analysis.
Key Algorithms:
* **k-Means Clustering: **Partitions the data into k clusters where each data point belongs to the cluster with the nearest mean.
* Hierarchical Clustering: Builds a hierarchy of clusters by either merging or splitting them based on similarity.
What Are Inferential Statistics?
Summary: Techniques that allow you to make predictions or inferences about a population based on a sample of data.
Key Terms:
* **Hypothesis Testing: **A method for testing a hypothesis about a parameter in a population using data.
* **p-Value: **The probability of observing the data if the null hypothesis is true; a low p-value suggests that the null hypothesis may not be true.
* ** Confidence Intervals: A range of values that is likely to contain the population parameter, providing an estimate with a level of confidence.
What is Classification in Machine Learning?
Summary: A type of machine learning where the goal is to predict categories or labels (e.g., spam or not spam).
Key Algorithms:
* ** Decision Trees:** A model that splits the data into branches to make decisions based on feature values.
* **Random Forests: **An ensemble method that builds multiple decision trees and merges their results for better accuracy.
* Support Vector Machines (SVM): A model that finds the best boundary (hyperplane) to separate different classes in the data.
What is the Difference Between Correlation and Causation?
Summary: Correlation measures the relationship between two variables, while causation implies that one variable causes a change in another.
Key Terms:
** Correlation Coefficient:** A value ranging from -1 to 1 that indicates the strength and direction of the relationship between two variables; 1 means perfect positive correlation, -1 means perfect negative correlation, and 0 means no correlation.
What is Dimensionality Reduction?
Summary: Reducing the number of variables (features) in a dataset while retaining its essential information.
Goal:
- reducing storage
- reducing computational time
- removing redundant features
Key Techniques:
* Principal Component Analysis (PCA): Transforms the data into a set of linearly uncorrelated components, ordered by the amount of variance they capture.
* ** t-SNE (t-Distributed Stochastic Neighbor Embedding):** A technique for reducing the dimensions of data while preserving its structure, often used for visualizing high-dimensional data.
What is Cross-Validation?
Summary: A technique to assess how a model generalizes to an independent dataset, often used to prevent overfitting.
Key Terms:
** k-Fold Cross-Validation: **The data is split into k subsets, and the model is trained and validated k times, each time using a different subset as the validation set and the remaining as the training set.
What Are Model Evaluation Metrics?
Summary: Metrics to evaluate the performance of your models.
Key Metrics:
* Accuracy: The ratio of correctly predicted instances to the total instances.
* Precision: The ratio of correctly predicted positive observations to the total predicted positives.
* Recall: The ratio of correctly predicted positive observations to all the observations in the actual class.
* ** F1 Score: **The harmonic mean of precision and recall, providing a balance between them.