Data Scientist Interview Flashcards

Question 1

Q

What is variance?

Answer

A

Variance measures data spread by averaging squared differences from the mean. Higher variance means more spread.

Variance is a key concept in statistics that helps in understanding the dispersion of data points in a dataset.

Question 2

Q

Difference between mean and median?

Answer

A

Mean is the average. Median is the middle value. Median is less affected by outliers.

The mean can be skewed by extreme values, while the median provides a better measure of central tendency in such cases.

Question 3

Q

How do you detect outliers?

Answer

A

Use Z-score (>3), IQR (1.5*IQR rule), or visualize with boxplots and histograms.

Outlier detection is crucial for data cleaning and ensuring the accuracy of statistical analyses.

Question 4

Q

Explain Bayes’ Theorem

Answer

A

It updates probabilities based on new evidence: P(A|B) = P(B|A) * P(A) / P(B).

Bayes’ Theorem is fundamental in statistics and machine learning for making inferences based on prior knowledge.

Question 5

Q

When to use T-test vs. Chi-square vs. ANOVA?

Answer

A

T-test: 2 means, Chi-square: categorical independence, ANOVA: 3+ means.

These tests are used to compare means or distributions in various scenarios, depending on the number of groups and data types.

Question 6

Q

What’s a p-value?

Answer

A

The probability of getting the observed result if the null hypothesis is true. p < 0.05 is significant.

P-values are used in hypothesis testing to determine the strength of the evidence against the null hypothesis.

Question 7

Q

What is Multicollinearity, and how to detect it?

Answer

A

When features are highly correlated. Detect using VIF (>5) or correlation matrices.

Multicollinearity can affect the stability of regression coefficients and complicate interpretations.

Question 8

Q

How do you handle imbalanced datasets?

Answer

A

Resampling, class weighting, better metrics like F1-score, and tree-based models.

Addressing imbalance is essential for improving model performance and ensuring fair predictions.

Question 9

Q

What is a window function?

Answer

A

A function that performs calculations across a specific row window, like RANK or LEAD/LAG.

Question 10

Q

Difference between RANK, DENSE_RANK, and ROW_NUMBER?

Answer

A

RANK skips ranks on ties, DENSE_RANK does not, ROW_NUMBER gives unique numbers.

Question 11

Q

What are CTEs and subqueries?

Answer

A

CTEs (Common Table Expression) improve readability and can be recursive. Subqueries are nested queries inside another query.

Question 12

Q

Difference between INNER JOIN and LEFT JOIN?

Answer

A

INNER JOIN keeps only matching rows, LEFT JOIN keeps all left table rows.

Question 13

Q

What is vectorization in Python?

Answer

A

Using NumPy or pandas operations instead of loops for faster execution.

Question 14

Q

What is the difference between a list and a tuple?

Answer

A

Lists are mutable, tuples are immutable and faster.

Question 15

Q

What is a hash table?

Answer

A

A data structure that maps keys to values using a hash function for fast lookup.

Question 16

Q

Difference between BFS and DFS?

Answer

A

BFS (Breadth First Search) explores level by level, DFS (Depth First Search) goes deep first.

Question 17

Q

What is overfitting?

Answer

A

A model that learns noise instead of patterns, performing poorly on new data.

Question 18

Q

Difference between supervised and unsupervised learning?

Answer

A

Supervised uses labeled data, unsupervised finds patterns in unlabeled data.

Question 19

Q

What is the difference between bagging and boosting?

Answer

A

Bagging reduces variance (Random Forest), boosting corrects errors (XGBoost).

Question 20

Q

Difference between logistic regression and SVM?

Answer

A

Logistic regression is simpler, SVM works better for complex decision boundaries.

Question 21

Q

Why use ReLU in neural networks?

Answer

A

It avoids vanishing gradients and speeds up training.

Question 22

Q

What is backpropagation?

Answer

A

A method to update weights using gradients of the loss function.

Question 23

Q

Difference between CNNs and RNNs?

Answer

A

CNNs handle images, RNNs handle sequential data like text or time series.

Question 24

Q

What is dropout in neural networks?

Answer

A

Randomly deactivates neurons during training to prevent overfitting.

Question 25

Q

Difference between batch normalization and layer normalization?

Answer

A

Batch norm normalizes per feature (better for CNNs and dense networks)

Layer norm normalizes per sample (RNNs, transformers, and small batch sizes)

Both stabilize and speed up training in deep neural networks by normalizing activations.

Question 26

Q

What is MapReduce?

Answer

A

A framework for processing big data by splitting tasks into map and reduce steps.

Map Phase: Convert words into key-value pairs [(‘apple’, 1), (‘banana’, 1), (‘apple’, 1), (‘banana’, 1), (‘orange’, 1), (‘apple’, 1)]

Shuffle & Sort: Group by word {‘apple’: [1, 1, 1], ‘banana’: [1, 1], ‘orange’: [1]}

Reduce Phase: Sum up values for each word. {‘apple’: 3, ‘banana’: 2, ‘orange’: 1}

Question 27

Q

What is a data pipeline?

Answer

A

A sequence of data processing steps, often automated.

Question 28

Q

Difference between ETL and ELT?

Answer

A

ETL transforms data before loading, ELT loads first, then transforms.

Question 29

Q

What is a NoSQL database?

Answer

A

A non-relational database like MongoDB that stores data in flexible formats.

Question 30

Q

What metrics would you track for a new product?

Answer

A

Retention, churn, conversion rate, and user engagement.

Question 31

Q

How do you improve a recommendation system?

Answer

A

Use collaborative filtering, content-based filtering, or hybrid approaches.

Question 32

Q

Difference between DAU, MAU, and retention rate?

Answer

A

DAU: daily active users, MAU: monthly active users, retention: returning users.

Question 33

Q

How do you measure A/B test success?

Answer

A

Low p-value

Significant lift in key metric (ie conversion rate, revenue, engagement, etc).
- Lift is improvement in test group (new version w changes being tested, B) compared to control group (og version, A)

Question 34

Q

How do you find the second-highest salary in SQL?

Answer

A

SELECT MAX(salary) FROM employees WHERE salary < (SELECT MAX(salary) FROM employees);

Question 35

Q

How do you count the number of customers per country in SQL?

Answer

A

SELECT country, COUNT(*) FROM customers GROUP BY country;

Question 36

Q

How do you find customers who placed more than 2 orders in SQL?

Answer

A

SELECT customer_id FROM orders GROUP BY customer_id HAVING COUNT(order_id) > 2;

Question 37

Q

How do you calculate the rolling average of sales in the last 3 months in SQL?

Answer

A

SELECT month, sales, AVG(sales) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) FROM sales_data;

Question 38

Q

How do you find the top 3 highest salaries per department in SQL?

Answer

A

SELECT employee_name, department, salary, RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rank_salary FROM employees WHERE rank_salary <= 3;

Question 39

Q

How do you count missing values in each column using Python & Pandas?

Answer

A

df.isnull().sum()

Question 40

Q

How do you drop duplicate rows in Python & Pandas?

Answer

A

df.drop_duplicates(inplace=True)

Question 41

Q

How do you get the top 5 most common values in a column using Python & Pandas?

Answer

A

df['category'].value_counts().head(5)

Question 42

Q

How do you group by and compute the mean in Python & Pandas?

Answer

A

df.groupby('category')['sales'].mean()

Question 43

Q

How do you apply a function to clean a text column in Python & Pandas?

Answer

A

df['clean_text'] = df['text'].apply(lambda x: x.lower().strip())

Question 44

Q

How do you merge two DataFrames in Python & Pandas?

Answer

A

merged_df = pd.merge(df1, df2, on='customer_id', how='left')

Question 45

Q

What is variance?

Answer

A

Variance measures how spread out data is from the mean.

Question 46

Q

What is the difference between mean and median?

Answer

A

Mean is the average, median is the middle value.

Question 47

Q

How do you detect outliers?

Answer

A

Using IQR, Z-scores, or visualization (boxplot, histogram).

Question 48

Q

Explain Bayes’ Theorem

Answer

A

It calculates conditional probability: P(A|B) = P(B|A) * P(A) / P(B).

Question 49

Q

When should you use T-test vs. Chi-square vs. ANOVA?

Answer

A

T-test compares two means, Chi-square tests categorical data, ANOVA compares multiple means.

Question 50

Q

What’s a p-value?

Answer

A

The probability of observing results at least as extreme as the null hypothesis.

Question 51

Q

What is multicollinearity, and how can you detect it?

Answer

A

High correlation between features, detected using VIF scores.

Question 52

Q

How do you handle imbalanced datasets?

Answer

A

Use oversampling, undersampling, SMOTE, or adjust class weights.

Question 53

Q

What is the difference between Linear & Logistic Regression?

Answer

A

Linear predicts continuous values, logistic predicts probabilities.

Question 54

Q

What is overfitting and how can you prevent it?

Answer

A

Overfitting happens when a model learns noise; prevent it using regularization, dropout, or more data.

Question 55

Q

Explain Precision, Recall, F1-score

Answer

A

Precision = TP / (TP + FP), Recall = TP / (TP + FN), F1 balances both.

Question 56

Q

What is the difference between L1 (Lasso) & L2 (Ridge) Regularization?

Answer

A

L1 shrinks weights to zero, L2 penalizes large weights.

Question 57

Q

What’s the use of Cross-Validation?

Answer

A

Prevents overfitting by evaluating models on different subsets of data.

Question 58

Q

Why is Feature Scaling important?

Answer

A

Ensures equal weight for all features, improves gradient descent.

Question 59

Q

How does PCA work for dimensionality reduction?

Answer

A

It projects data onto new axes to capture maximum variance.

Question 60

Q

Explain Random Forest vs. Gradient Boosting

Answer

A

Random Forest reduces variance, boosting corrects model errors iteratively.

Question 61

Q

How would you detect fake reviews?

Answer

A

Use NLP, sentiment analysis, and user behavior patterns.

Question 62

Q

How would you improve a recommendation system?

Answer

A

Hybrid filtering, better embeddings, A/B testing models.

Question 63

Q

How do you analyze a drop in user engagement?

Answer

A

Check retention, session data, A/B tests, and feature changes.

Question 64

Q

How would you optimize a customer churn model?

Answer

A

Use classification models, analyze key churn indicators, improve retention strategies.

Answer 65

A

Clearing Messy Data:
1. Remove duplicate rows
2. Handle missing values
3. Convert data types (put dates in datetime format)
4. Standardize text data (str.lower().strip())
5. Handle outliers
6. Normalize or scale features (min max scale or z-score normalization)
7. Fix inconsistent categories (instead of having some labels as NY and some as New York for example)
8. Convert categories to numbers (label encoding or one hot encoding)
9. Create derived variables (new variables based on OG variables to help improve model performance, ie parts of a date or age from birthdate)
10. Verify data consistency (ensure all data adheres to same rules, ie consistent formats, currency, no impossible ages or dates, the ranges of data are realistic)

Answer 66

A

Discuss the problem, your approach, and impact.

Answer 67

A

Use visuals, storytelling, and focus on business impact.

Answer 68

A

Situation, Task, Action, Result