1 Flashcards

1
Q

What is the difference between INNER JOIN and LEFT JOIN?

A

INNER JOIN returns only matching rows from both tables, while LEFT JOIN returns all rows from the left table and matching rows from the right table, with NULLs where there are no matches.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a GROUP BY clause used for?

A

GROUP BY is used to group rows that have the same values in specified columns into summary rows, like COUNT, SUM, AVG, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How does the HAVING clause differ from WHERE?

A

WHERE filters rows before grouping, while HAVING filters rows after grouping.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a Subquery in SQL?

A

A subquery is a query nested inside another query, often used to return a single value or a set of values to be used in the main query.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a Common Table Expression (CTE)?

A

A CTE is a temporary result set defined within the execution scope of a SELECT, INSERT, UPDATE, or DELETE statement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are window functions in SQL? Provide examples.

A

Window functions perform calculations across a set of table rows related to the current row. Examples: ROW_NUMBER(), RANK(), LEAD(), LAG().

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a UNION vs. UNION ALL?

A

UNION combines results from multiple queries and removes duplicates, while UNION ALL combines all results, including duplicates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the difference between COUNT(*) and COUNT(column_name)?

A

COUNT(*) counts all rows, including NULLs, while COUNT(column_name) counts only non-NULL values in the specified column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain JOIN types: INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL OUTER JOIN.

A
  • INNER JOIN: Returns matching rows from both tables. * LEFT JOIN: Returns all rows from the left table and matched rows from the right. * RIGHT JOIN: Returns all rows from the right table and matched rows from the left. * FULL OUTER JOIN: Returns rows when there is a match in either left or right table.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do you find duplicate records in SQL?

A

Use GROUP BY and HAVING COUNT(*) > 1 to identify duplicate records.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a primary key and a foreign key?

A

A primary key uniquely identifies each record in a table, while a foreign key is a field in one table that links to the primary key of another table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you perform a self join in SQL?

A

A self join is when a table is joined with itself. You must use table aliases to distinguish the two instances of the table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a cross join? Provide an example.

A

A cross join returns the Cartesian product of two tables, combining every row from the first table with every row from the second table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain the difference between CHAR and VARCHAR data types.

A

CHAR stores fixed-length strings, while VARCHAR stores variable-length strings.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Normalization in database design?

A

Normalization is the process of organizing a database to reduce redundancy and dependency by dividing tables and ensuring relationships between them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Denormalization in database design?

A

Denormalization is the process of combining tables or adding redundant data to optimize query performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Explain Indexing in SQL and how it improves query performance.

A

Indexing improves the speed of data retrieval operations by creating a structure that allows quicker searches, typically used on columns frequently involved in WHERE, JOIN, or ORDER BY clauses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a transaction in SQL and how is it used?

A

A transaction is a set of SQL commands that are executed as a single unit, ensuring atomicity. Transactions are controlled with BEGIN, COMMIT, and ROLLBACK.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the DISTINCT keyword used for in SQL?

A

The DISTINCT keyword is used to return only unique values from a column or set of columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is GROUP_CONCAT() function in MySQL or STRING_AGG() in SQL Server?

A

These functions are used to concatenate values from multiple rows into a single string.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the difference between mean and median?

A

The mean is the average of all values, while the median is the middle value in a sorted data set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is standard deviation, and how is it calculated?

A

Standard deviation measures the amount of variation or dispersion in a set of values. It’s the square root of the variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Explain variance and its relation to standard deviation.

A

Variance is the average squared deviation from the mean, and standard deviation is the square root of variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a normal distribution?

A

A normal distribution is a bell-shaped probability distribution where most of the values cluster around the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is a p-value and what does it signify in hypothesis testing?
A p-value measures the probability of obtaining results as extreme as the observed results under the null hypothesis. A small p-value (<0.05) suggests rejecting the null hypothesis.
26
What is a confidence interval?
A confidence interval is a range of values that likely contains the true population parameter with a given level of confidence (e.g., 95%).
27
What is Bayes’ Theorem?
Bayes' Theorem describes the probability of an event based on prior knowledge of conditions related to the event.
28
What is the difference between correlation and causation?
Correlation means there is a statistical relationship between two variables, while causation means one variable directly affects the other.
29
What is the Central Limit Theorem?
The Central Limit Theorem states that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases, regardless of the original data distribution.
30
What is a z-score and how is it interpreted?
A z-score measures how many standard deviations a data point is from the mean. A z-score of 1 means the data point is one standard deviation above the mean.
31
What is hypothesis testing? Provide an example.
Hypothesis testing is used to determine if there is enough evidence to reject a null hypothesis. Example: Testing if a new drug is more effective than a placebo.
32
What is ANOVA (Analysis of Variance)?
ANOVA is a statistical method used to compare the means of three or more groups to see if at least one differs significantly.
33
What is the difference between parametric and non-parametric tests?
Parametric tests assume the data follows a known distribution, while non-parametric tests do not assume any specific distribution.
34
What is Type I and Type II error?
Type I error is rejecting a true null hypothesis (false positive), while Type II error is failing to reject a false null hypothesis (false negative).
35
What is a Chi-square test?
The Chi-square test is used to determine if there is a significant association between two categorical variables.
36
How do you create a DataFrame in Pandas?
Use `pd.DataFrame()` to create a DataFrame. Example: `df = pd.DataFrame(data)`.
37
What is the difference between a Series and a DataFrame in Pandas?
A Series is a one-dimensional array, while a DataFrame is a two-dimensional table with rows and columns.
38
How do you handle missing data in Pandas?
Use `.isnull()` to detect missing values and `.fillna()` or `.dropna()` to handle them.
39
How do you merge two DataFrames in Pandas?
Use `.merge()` to combine DataFrames, similar to SQL joins.
40
What is the purpose of `.apply()` in Pandas?
.apply() is used to apply a function along the axis (rows or columns) of a DataFrame.
41
How do you convert a column to a specific data type in Pandas?
Use `.astype()` to convert a column to a specific data type. Example: `df['column'] = df['column'].astype(int)`.
42
How do you filter rows based on a condition in Pandas?
Use boolean indexing. Example: `df[df['column'] > 10]`.
43
How do you handle duplicates in Pandas?
Use `.drop_duplicates()` to remove duplicate rows.
44
What is the purpose of `.groupby()` in Pandas?
.groupby() is used to group data based on a column(s) and apply an aggregation function, such as `sum()`, `mean()`, etc.
45
How do you concatenate two DataFrames vertically in Pandas?
Use `pd.concat()` with `axis=0` to concatenate vertically.
46
What is a `SELF JOIN` and when is it useful?
A self join is a table joined with itself. It’s useful when you need to compare rows within the same table, like hierarchical data (e.g., employee-manager relationships).
47
What is the purpose of `PARTITION BY` in window functions?
`PARTITION BY` divides the result set into partitions and performs the calculation on each partition independently, allowing for windowed aggregations.
48
What is `CROSS APPLY` in SQL Server, and how is it different from `OUTER APPLY`?
`CROSS APPLY` returns only rows where the table-valued function produces results, while `OUTER APPLY` returns all rows from the left table, with NULLs where the function produces no result.
49
What is the difference between `RANK()` and `DENSE_RANK()`?
`RANK()` skips rank values when there are ties (e.g., two rows have rank 1, the next rank will be 3), while `DENSE_RANK()` does not skip ranks in the event of ties.
50
Explain `ROLLUP` and `CUBE` in SQL.
`ROLLUP` provides subtotals and grand totals for a result set, while `CUBE` generates subtotals for every combination of grouped columns.
51
What is `INTERSECT` in SQL and how does it work?
`INTERSECT` returns the common rows from two queries, removing duplicates, similar to the intersection of two sets.
52
What is the difference between `UNION` and `UNION ALL` in terms of performance?
`UNION` performs an additional step of removing duplicates, which makes it slower than `UNION ALL`, which doesn’t remove duplicates.
53
How would you implement pagination in SQL?
Use `LIMIT` and `OFFSET` in SQL (or `ROW_NUMBER()` in complex queries) to paginate results.
54
How does `EXPLAIN` or `EXPLAIN PLAN` help in query optimization?
`EXPLAIN` shows the query execution plan, revealing how SQL queries are executed, helping identify performance bottlenecks and optimize indexes.
55
What is a `materialized view` in SQL?
A materialized view stores the result of a query physically, and is periodically refreshed. It is useful for performance in large datasets.
56
Explain the difference between a **parametric** and **non-parametric** hypothesis test.
Parametric tests assume underlying data distributions (e.g., t-tests for normal data), while non-parametric tests do not assume any distribution (e.g., Wilcoxon signed-rank test).
57
What is **maximum likelihood estimation (MLE)**?
MLE is a method for estimating the parameters of a statistical model by maximizing the likelihood that the observed data was generated by the model.
58
What is a **log-normal distribution**?
A log-normal distribution is a probability distribution of a random variable whose logarithm is normally distributed.
59
What is **bootstrap sampling** and how is it used?
Bootstrap sampling is a technique that involves repeatedly sampling from a dataset with replacement to estimate the distribution of a statistic.
60
Explain the **Kurtosis** of a distribution.
Kurtosis measures the 'tailedness' of a distribution. High kurtosis means more extreme outliers, while low kurtosis indicates a more uniform distribution.
61
What is the **F-test** used for?
The F-test is used to compare two variances and determine if they are significantly different. It's often used in ANOVA to compare the variances between group means.
62
What is **Cohen’s d** used to measure?
Cohen’s d is a measure of effect size, indicating the standardized difference between two group means. It’s commonly used in hypothesis testing to assess the practical significance of results.
63
What is **Multicollinearity**, and why is it a problem in regression?
Multicollinearity occurs when predictor variables in a regression model are highly correlated, making it difficult to determine the individual effect of each variable on the outcome.
64
What is a **Chi-square goodness of fit test**?
A Chi-square goodness of fit test determines whether observed categorical data fits an expected distribution.
65
What is the **Gini coefficient**, and how is it used in statistics?
The Gini coefficient is a measure of inequality in a distribution, ranging from 0 (perfect equality) to 1 (perfect inequality), often used in economics and classification models.
66
How do you apply custom functions across rows or columns in Pandas?
Use `.apply()` with a custom function and specify `axis=0` for columns or `axis=1` for rows.
67
What is the difference between `.iloc[]` and `.loc[]` in Pandas?
.iloc[] is used for integer-location based indexing, while .loc[] is label-based indexing.
68
What is `np.vectorize()` in NumPy, and how does it improve performance?
`np.vectorize()` is a convenience function that applies a function element-wise over an array, improving readability and performance compared to a Python loop.
69
How do you merge two DataFrames on multiple columns in Pandas?
Use `pd.merge(df1, df2, on=['col1', 'col2'])` to merge two DataFrames on multiple columns.
70
What is a **pivot table** in Pandas?
A pivot table aggregates data in a DataFrame based on column(s) and row(s), allowing for easy summary and comparison of values.
71
How do you perform a time series analysis in Pandas?
Use `.resample()` for date-based grouping, and `.rolling()` or `.expanding()` for calculating rolling or cumulative statistics.
72
What is the difference between `applymap()` and `apply()` in Pandas?
.apply() is used for applying a function along rows or columns, while .applymap() is used for applying a function element-wise to all DataFrame elements.
73
How do you handle categorical variables in Pandas?
Use `.get_dummies()` to create one-hot encoding for categorical variables or `.astype('category')` to convert columns to categorical types.
74
How do you calculate the **rolling mean** for a time series in Pandas?
Use `.rolling(window).mean()` to compute the rolling mean for a time series.
75
What is **broadcasting** in NumPy?
Broadcasting allows NumPy to perform arithmetic operations on arrays of different shapes by automatically expanding the smaller array to match the larger one’s shape.
76
What is **cross-validation**, and why is it used?
Cross-validation is a technique used to evaluate a model’s performance by splitting data into training and validation sets multiple times, helping to reduce overfitting and provide a more accurate performance estimate.
77
What is **bias-variance tradeoff** in machine learning?
The bias-variance tradeoff refers to the balance between a model’s ability to generalize (low bias) and its sensitivity to training data (low variance). A high bias leads to underfitting, while a high variance leads to overfitting.
78
What is the difference between **L1 regularization** and **L2 regularization**?
L1 regularization (Lasso) adds the absolute value of coefficients as a penalty to the cost function, promoting sparsity. L2 regularization (Ridge) adds the squared value of coefficients, helping to prevent large coefficients but not promoting sparsity.
79
What is the **confusion matrix**, and how is it interpreted?
A confusion matrix is a table that visualizes the performance of a classification model by showing the actual vs predicted classifications. It includes metrics like precision, recall, F1 score, and accuracy.
80
What is **precision** and **recall** in classification?
Precision is the ratio of true positive predictions to all positive predictions, while recall is the ratio of true positive predictions to all actual positives in the dataset.
81
Explain **Gradient Descent** and its variants.
Gradient Descent is an optimization algorithm used to minimize the loss function by adjusting model parameters in the opposite direction of the gradient. Variants include batch, stochastic, and mini-batch gradient descent.
82
What is **overfitting** and how can it be prevented in machine learning models?
Overfitting occurs when a model learns the noise in the training data instead of the underlying patterns, leading to poor performance on new data. It can be prevented by using regularization, cross-validation, and pruning decision trees.
83
What is **K-fold cross-validation**?
K-fold cross-validation splits the data into K subsets, trains the model K times, each time with a different training set and testing on the corresponding validation set, ensuring a more generalized performance evaluation.
84
What is a **decision tree**, and how does it work?
A decision tree is a tree-like structure used for classification or regression. It splits the dataset into subsets based on the feature that maximizes information gain or minimizes impurity.
85
What is the **curse of dimensionality**, and how does it affect machine learning?
The curse of dimensionality refers to the exponential increase in complexity as the number of features increases, leading to sparsity in data and reducing the model's ability to generalize.
86
What is the difference between `INNER JOIN`, `LEFT JOIN`, and `RIGHT JOIN` in SQL?
`INNER JOIN`: Returns rows when there is a match in both tables. `LEFT JOIN`: Returns all rows from the left table and matched rows from the right table. `RIGHT JOIN`: Returns all rows from the right table and matched rows from the left table.
87
What is a CTE (Common Table Expression) in SQL and how is it different from subqueries?
A CTE is a temporary result set defined within the execution scope of a `SELECT`, `INSERT`, `UPDATE`, or `DELETE` statement. It is more readable and reusable compared to subqueries.
88
Explain the difference between `HAVING` and `WHERE` clauses in SQL.
`WHERE` is used to filter records before any grouping is done. `HAVING` is used to filter groups after the `GROUP BY` operation.
89
What is the purpose of `INDEX` in SQL?
An index improves the speed of data retrieval operations on a database table at the cost of additional space and time required for `INSERT`, `UPDATE`, and `DELETE` operations.
90
What is normalization in database design?
Normalization is the process of organizing data in a database to avoid redundancy and dependency by dividing large tables into smaller, manageable ones.
91
What is the difference between a `clustered index` and a `non-clustered index`?
`Clustered Index`: The table's data is physically ordered on the disk according to the index. There can only be one clustered index per table. `Non-clustered Index`: The index is stored separately from the table's data, and it points to the data in the table.
92
What is the difference between `UNION` and `UNION ALL` in SQL?
`UNION`: Combines the results of two queries and removes duplicates. `UNION ALL`: Combines the results and does not remove duplicates.
93
What is a window function in SQL? Provide an example.
A window function performs a calculation across a set of table rows related to the current row. Example: `ROW_NUMBER()`, `RANK()`, `SUM() OVER (PARTITION BY ...)`.
94
How would you handle missing data in a dataset using pandas?
- Use `df.fillna()` to fill missing values with a constant or statistical value (mean, median). - Use `df.dropna()` to remove rows or columns with missing data.
95
What is the difference between `pivot()` and `pivot_table()` in pandas?
`pivot()` is used when you have a simple reshaping requirement without aggregation. `pivot_table()` allows for aggregation (e.g., sum, mean) during reshaping.
96
What is the difference between `groupby()` and `agg()` in pandas?
`groupby()`: Splits the data into groups based on certain columns, but requires a follow-up operation (like `sum()`, `mean()`). `agg()`: Aggregates multiple columns at once with different functions.
97
How do you merge dataframes in pandas with multiple conditions?
Use the `merge()` function and specify multiple columns in the `on` parameter: ```python pd.merge(df1, df2, on=['col1', 'col2']) ```
98
Explain the difference between `classification` and `regression` problems in machine learning.
`Classification`: Predicts discrete labels (e.g., spam or not spam). `Regression`: Predicts continuous numerical values (e.g., house price).
99
What is a confusion matrix in classification?
A confusion matrix is a table that describes the performance of a classification model, showing the actual vs predicted values. It includes metrics like accuracy, precision, recall, and F1 score.
100
What is the purpose of cross-validation in machine learning?
Cross-validation is used to assess how the results of a statistical analysis generalize to an independent dataset, helping to mitigate overfitting.
101
What is the difference between `L1` and `L2` regularization in machine learning?
`L1 regularization` (Lasso) adds the absolute value of coefficients as a penalty to the loss function. `L2 regularization` (Ridge) adds the squared value of coefficients as a penalty.
102
What is a decision tree in machine learning?
A decision tree is a supervised learning model used for classification and regression, which splits the data based on feature values to predict an outcome.
103
Explain the term `bias-variance tradeoff` in machine learning.
`Bias`: Error due to overly simplistic models (underfitting). `Variance`: Error due to overly complex models (overfitting). Balancing them ensures the model generalizes well.
104
What is the difference between `bagging` and `boosting`?
`Bagging` (e.g., Random Forest) combines multiple models trained independently to reduce variance. `Boosting` (e.g., AdaBoost, XGBoost) sequentially trains models, focusing on correcting errors of previous models to reduce bias.
105
What is PCA (Principal Component Analysis) and why is it used?
PCA is a dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional form while preserving as much variance as possible.
106
Write a query to create an index on the `customer_id` column of the `orders` table.
CREATE INDEX idx_customer_id ON orders(customer_id);
107
Write a query to create a CTE (Common Table Expression) that selects the average order value for each customer from the `orders` table.
WITH avg_order_value AS ( SELECT customer_id, AVG(order_value) AS avg_value FROM orders GROUP BY customer_id ) SELECT * FROM avg_order_value;
108
Write a query to find all employees who have been with the company for more than 5 years using a `DATE` column.
SELECT employee_id, name, hire_date FROM employees WHERE hire_date < DATE_SUB(CURDATE(), INTERVAL 5 YEAR);
109
Write a query to find all orders from `orders` table where the `order_date` is in the last 30 days.
SELECT * FROM orders WHERE order_date >= CURDATE() - INTERVAL 30 DAY;
110
Write Python code to create a Pandas DataFrame from a dictionary of lists.
import pandas as pd data = { 'customer_id': [1, 2, 3], 'order_value': [100, 150, 200], 'order_date': ['2023-01-01', '2023-01-02', '2023-01-03'] } df = pd.DataFrame(data) print(df)
111
Write Python code to filter rows in a Pandas DataFrame where `order_value` is greater than 150.
filtered_df = df[df['order_value'] > 150] print(filtered_df)
112
Write a query to calculate the total sales for each month from the `sales` table.
SELECT MONTH(sales_date) AS month, SUM(sale_amount) AS total_sales FROM sales GROUP BY MONTH(sales_date);
113
Write a query to join the `orders` table with the `customers` table on `customer_id`, showing `customer_name` and `order_value`.
SELECT c.customer_name, o.order_value FROM orders o JOIN customers c ON o.customer_id = c.customer_id;
114
Write Python code to calculate the mean and standard deviation of a numeric column (`order_value`) in a Pandas DataFrame.
mean_value = df['order_value'].mean() std_value = df['order_value'].std() print(f'Mean: {mean_value}, Standard Deviation: {std_value}')
115
Write Python code to convert a `order_date` column in a Pandas DataFrame to datetime format and extract the year.
df['order_date'] = pd.to_datetime(df['order_date']) df['order_year'] = df['order_date'].dt.year print(df)
116
Write a query to use `GROUP BY` to calculate the total sales for each `customer_id` from the `orders` table.
SELECT customer_id, SUM(order_value) AS total_sales FROM orders GROUP BY customer_id;
117
Write a query to find the top 5 customers with the highest total sales using `LIMIT`.
SELECT customer_id, SUM(order_value) AS total_sales FROM orders GROUP BY customer_id ORDER BY total_sales DESC LIMIT 5;
118
Write Python code to create a new column in a Pandas DataFrame that contains the logarithm of `order_value`.
import numpy as np df['log_order_value'] = np.log(df['order_value']) print(df)
119
Write Python code to sort a Pandas DataFrame by the `order_value` column in descending order.
df_sorted = df.sort_values(by='order_value', ascending=False) print(df_sorted)
120
Write a query to update the `order_value` of an order in the `orders` table where `order_id` is 101.
UPDATE orders SET order_value = 250 WHERE order_id = 101;
121
Write a query to delete all records from the `orders` table where the `order_date` is older than 1 year.
DELETE FROM orders WHERE order_date < CURDATE() - INTERVAL 1 YEAR;
122
Write Python code to create a boxplot of `order_value` in a Pandas DataFrame.
import matplotlib.pyplot as plt df['order_value'].plot(kind='box') plt.show()
123
Write Python code to merge two DataFrames `df1` and `df2` on the `customer_id` column.
merged_df = pd.merge(df1, df2, on='customer_id') print(merged_df)
124
Write a query to find the second highest sale amount from the `sales` table.
SELECT MAX(sale_amount) AS second_highest FROM sales WHERE sale_amount < (SELECT MAX(sale_amount) FROM sales);
125
Write a query to find the employees with the highest salary in each department using a `JOIN` on `employees` and `departments` tables.
SELECT e.department_id, e.employee_id, e.salary FROM employees e JOIN departments d ON e.department_id = d.department_id WHERE e.salary = (SELECT MAX(salary) FROM employees WHERE department_id = e.department_id);