Data Scientist Interview Flashcards

1
Q

What is variance?

A

Variance measures data spread by averaging squared differences from the mean. Higher variance means more spread.

Variance is a key concept in statistics that helps in understanding the dispersion of data points in a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Difference between mean and median?

A

Mean is the average. Median is the middle value. Median is less affected by outliers.

The mean can be skewed by extreme values, while the median provides a better measure of central tendency in such cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do you detect outliers?

A

Use Z-score (>3), IQR (1.5*IQR rule), or visualize with boxplots and histograms.

Outlier detection is crucial for data cleaning and ensuring the accuracy of statistical analyses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain Bayes’ Theorem

A

It updates probabilities based on new evidence: P(A|B) = P(B|A) * P(A) / P(B).

Bayes’ Theorem is fundamental in statistics and machine learning for making inferences based on prior knowledge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When to use T-test vs. Chi-square vs. ANOVA?

A

T-test: 2 means, Chi-square: categorical independence, ANOVA: 3+ means.

These tests are used to compare means or distributions in various scenarios, depending on the number of groups and data types.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What’s a p-value?

A

The probability of getting the observed result if the null hypothesis is true. p < 0.05 is significant.

P-values are used in hypothesis testing to determine the strength of the evidence against the null hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Multicollinearity, and how to detect it?

A

When features are highly correlated. Detect using VIF (>5) or correlation matrices.

Multicollinearity can affect the stability of regression coefficients and complicate interpretations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you handle imbalanced datasets?

A

Resampling, class weighting, better metrics like F1-score, and tree-based models.

Addressing imbalance is essential for improving model performance and ensuring fair predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a window function?

A

A function that performs calculations across a specific row window, like RANK or LEAD/LAG.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Difference between RANK, DENSE_RANK, and ROW_NUMBER?

A

RANK skips ranks on ties, DENSE_RANK does not, ROW_NUMBER gives unique numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are CTEs and subqueries?

A

CTEs (Common Table Expression) improve readability and can be recursive. Subqueries are nested queries inside another query.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Difference between INNER JOIN and LEFT JOIN?

A

INNER JOIN keeps only matching rows, LEFT JOIN keeps all left table rows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is vectorization in Python?

A

Using NumPy or pandas operations instead of loops for faster execution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the difference between a list and a tuple?

A

Lists are mutable, tuples are immutable and faster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a hash table?

A

A data structure that maps keys to values using a hash function for fast lookup.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Difference between BFS and DFS?

A

BFS (Breadth First Search) explores level by level, DFS (Depth First Search) goes deep first.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is overfitting?

A

A model that learns noise instead of patterns, performing poorly on new data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Difference between supervised and unsupervised learning?

A

Supervised uses labeled data, unsupervised finds patterns in unlabeled data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the difference between bagging and boosting?

A

Bagging reduces variance (Random Forest), boosting corrects errors (XGBoost).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Difference between logistic regression and SVM?

A

Logistic regression is simpler, SVM works better for complex decision boundaries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Why use ReLU in neural networks?

A

It avoids vanishing gradients and speeds up training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is backpropagation?

A

A method to update weights using gradients of the loss function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Difference between CNNs and RNNs?

A

CNNs handle images, RNNs handle sequential data like text or time series.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is dropout in neural networks?

A

Randomly deactivates neurons during training to prevent overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Difference between batch normalization and layer normalization?

A

Batch norm normalizes per feature (better for CNNs and dense networks)

Layer norm normalizes per sample (RNNs, transformers, and small batch sizes)

Both stabilize and speed up training in deep neural networks by normalizing activations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is MapReduce?

A

A framework for processing big data by splitting tasks into map and reduce steps.

Map Phase: Convert words into key-value pairs [(‘apple’, 1), (‘banana’, 1), (‘apple’, 1), (‘banana’, 1), (‘orange’, 1), (‘apple’, 1)]

Shuffle & Sort: Group by word {‘apple’: [1, 1, 1], ‘banana’: [1, 1], ‘orange’: [1]}

Reduce Phase: Sum up values for each word. {‘apple’: 3, ‘banana’: 2, ‘orange’: 1}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is a data pipeline?

A

A sequence of data processing steps, often automated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Difference between ETL and ELT?

A

ETL transforms data before loading, ELT loads first, then transforms.

29
Q

What is a NoSQL database?

A

A non-relational database like MongoDB that stores data in flexible formats.

30
Q

What metrics would you track for a new product?

A

Retention, churn, conversion rate, and user engagement.

31
Q

How do you improve a recommendation system?

A

Use collaborative filtering, content-based filtering, or hybrid approaches.

32
Q

Difference between DAU, MAU, and retention rate?

A

DAU: daily active users, MAU: monthly active users, retention: returning users.

33
Q

How do you measure A/B test success?

A

Low p-value

Significant lift in key metric (ie conversion rate, revenue, engagement, etc).
- Lift is improvement in test group (new version w changes being tested, B) compared to control group (og version, A)

34
Q

How do you find the second-highest salary in SQL?

A

SELECT MAX(salary) FROM employees WHERE salary < (SELECT MAX(salary) FROM employees);

35
Q

How do you count the number of customers per country in SQL?

A

SELECT country, COUNT(*) FROM customers GROUP BY country;

36
Q

How do you find customers who placed more than 2 orders in SQL?

A

SELECT customer_id FROM orders GROUP BY customer_id HAVING COUNT(order_id) > 2;

37
Q

How do you calculate the rolling average of sales in the last 3 months in SQL?

A

SELECT month, sales, AVG(sales) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) FROM sales_data;

38
Q

How do you find the top 3 highest salaries per department in SQL?

A

SELECT employee_name, department, salary, RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rank_salary FROM employees WHERE rank_salary <= 3;

39
Q

How do you count missing values in each column using Python & Pandas?

A

df.isnull().sum()

40
Q

How do you drop duplicate rows in Python & Pandas?

A

df.drop_duplicates(inplace=True)

41
Q

How do you get the top 5 most common values in a column using Python & Pandas?

A

df['category'].value_counts().head(5)

42
Q

How do you group by and compute the mean in Python & Pandas?

A

df.groupby('category')['sales'].mean()

43
Q

How do you apply a function to clean a text column in Python & Pandas?

A

df['clean_text'] = df['text'].apply(lambda x: x.lower().strip())

44
Q

How do you merge two DataFrames in Python & Pandas?

A

merged_df = pd.merge(df1, df2, on='customer_id', how='left')

45
Q

What is variance?

A

Variance measures how spread out data is from the mean.

46
Q

What is the difference between mean and median?

A

Mean is the average, median is the middle value.

47
Q

How do you detect outliers?

A

Using IQR, Z-scores, or visualization (boxplot, histogram).

48
Q

Explain Bayes’ Theorem

A

It calculates conditional probability: P(A|B) = P(B|A) * P(A) / P(B).

49
Q

When should you use T-test vs. Chi-square vs. ANOVA?

A

T-test compares two means, Chi-square tests categorical data, ANOVA compares multiple means.

50
Q

What’s a p-value?

A

The probability of observing results at least as extreme as the null hypothesis.

51
Q

What is multicollinearity, and how can you detect it?

A

High correlation between features, detected using VIF scores.

52
Q

How do you handle imbalanced datasets?

A

Use oversampling, undersampling, SMOTE, or adjust class weights.

53
Q

What is the difference between Linear & Logistic Regression?

A

Linear predicts continuous values, logistic predicts probabilities.

54
Q

What is overfitting and how can you prevent it?

A

Overfitting happens when a model learns noise; prevent it using regularization, dropout, or more data.

55
Q

Explain Precision, Recall, F1-score

A

Precision = TP / (TP + FP), Recall = TP / (TP + FN), F1 balances both.

56
Q

What is the difference between L1 (Lasso) & L2 (Ridge) Regularization?

A

L1 shrinks weights to zero, L2 penalizes large weights.

57
Q

What’s the use of Cross-Validation?

A

Prevents overfitting by evaluating models on different subsets of data.

58
Q

Why is Feature Scaling important?

A

Ensures equal weight for all features, improves gradient descent.

59
Q

How does PCA work for dimensionality reduction?

A

It projects data onto new axes to capture maximum variance.

60
Q

Explain Random Forest vs. Gradient Boosting

A

Random Forest reduces variance, boosting corrects model errors iteratively.

61
Q

How would you detect fake reviews?

A

Use NLP, sentiment analysis, and user behavior patterns.

62
Q

How would you improve a recommendation system?

A

Hybrid filtering, better embeddings, A/B testing models.

63
Q

How do you analyze a drop in user engagement?

A

Check retention, session data, A/B tests, and feature changes.

64
Q

How would you optimize a customer churn model?

A

Use classification models, analyze key churn indicators, improve retention strategies.

65
Q

Describe a time you worked with messy data.

A

Clearing Messy Data:
1. Remove duplicate rows
2. Handle missing values
3. Convert data types (put dates in datetime format)
4. Standardize text data (str.lower().strip())
5. Handle outliers
6. Normalize or scale features (min max scale or z-score normalization)
7. Fix inconsistent categories (instead of having some labels as NY and some as New York for example)
8. Convert categories to numbers (label encoding or one hot encoding)
9. Create derived variables (new variables based on OG variables to help improve model performance, ie parts of a date or age from birthdate)
10. Verify data consistency (ensure all data adheres to same rules, ie consistent formats, currency, no impossible ages or dates, the ranges of data are realistic)

66
Q

Describe a challenging data project and how you solved it.

A

Discuss the problem, your approach, and impact.

67
Q

How do you communicate insights to non-technical stakeholders?

A

Use visuals, storytelling, and focus on business impact.

68
Q

What does the acronym STAR stand for in the STAR interview method?

A

Situation, Task, Action, Result