1 Flashcards

Question

What is a p-value and what does it signify in hypothesis testing?

Answer 1

A p-value measures the probability of obtaining results as extreme as the observed results under the null hypothesis. A small p-value (<0.05) suggests rejecting the null hypothesis.

Answer 2

A confidence interval is a range of values that likely contains the true population parameter with a given level of confidence (e.g., 95%).

Answer 3

Bayes' Theorem describes the probability of an event based on prior knowledge of conditions related to the event.

Answer 4

Correlation means there is a statistical relationship between two variables, while causation means one variable directly affects the other.

Answer 5

The Central Limit Theorem states that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases, regardless of the original data distribution.

Answer 6

A z-score measures how many standard deviations a data point is from the mean. A z-score of 1 means the data point is one standard deviation above the mean.

Answer 7

Hypothesis testing is used to determine if there is enough evidence to reject a null hypothesis. Example: Testing if a new drug is more effective than a placebo.

Answer 8

ANOVA is a statistical method used to compare the means of three or more groups to see if at least one differs significantly.

Answer 9

Parametric tests assume the data follows a known distribution, while non-parametric tests do not assume any specific distribution.

Answer 10

Type I error is rejecting a true null hypothesis (false positive), while Type II error is failing to reject a false null hypothesis (false negative).

Answer 11

The Chi-square test is used to determine if there is a significant association between two categorical variables.

Answer 12

Use `pd.DataFrame()` to create a DataFrame. Example: `df = pd.DataFrame(data)`.

Answer 13

A Series is a one-dimensional array, while a DataFrame is a two-dimensional table with rows and columns.

Answer 14

Use `.isnull()` to detect missing values and `.fillna()` or `.dropna()` to handle them.

Answer 15

Use `.merge()` to combine DataFrames, similar to SQL joins.

Answer 16

.apply() is used to apply a function along the axis (rows or columns) of a DataFrame.

Answer 17

Use `.astype()` to convert a column to a specific data type. Example: `df['column'] = df['column'].astype(int)`.

Answer 18

Use boolean indexing. Example: `df[df['column'] > 10]`.

Answer 19

Use `.drop_duplicates()` to remove duplicate rows.

Answer 20

.groupby() is used to group data based on a column(s) and apply an aggregation function, such as `sum()`, `mean()`, etc.

Answer 21

Use `pd.concat()` with `axis=0` to concatenate vertically.

Answer 22

A self join is a table joined with itself. It’s useful when you need to compare rows within the same table, like hierarchical data (e.g., employee-manager relationships).

Answer 23

`PARTITION BY` divides the result set into partitions and performs the calculation on each partition independently, allowing for windowed aggregations.

Answer 24

`CROSS APPLY` returns only rows where the table-valued function produces results, while `OUTER APPLY` returns all rows from the left table, with NULLs where the function produces no result.

Answer 25

`RANK()` skips rank values when there are ties (e.g., two rows have rank 1, the next rank will be 3), while `DENSE_RANK()` does not skip ranks in the event of ties.

Answer 26

`ROLLUP` provides subtotals and grand totals for a result set, while `CUBE` generates subtotals for every combination of grouped columns.

Answer 27

`INTERSECT` returns the common rows from two queries, removing duplicates, similar to the intersection of two sets.

Answer 28

`UNION` performs an additional step of removing duplicates, which makes it slower than `UNION ALL`, which doesn’t remove duplicates.

Answer 29

Use `LIMIT` and `OFFSET` in SQL (or `ROW_NUMBER()` in complex queries) to paginate results.

Answer 30

`EXPLAIN` shows the query execution plan, revealing how SQL queries are executed, helping identify performance bottlenecks and optimize indexes.

Answer 31

A materialized view stores the result of a query physically, and is periodically refreshed. It is useful for performance in large datasets.

Answer 32

Parametric tests assume underlying data distributions (e.g., t-tests for normal data), while non-parametric tests do not assume any distribution (e.g., Wilcoxon signed-rank test).

Answer 33

MLE is a method for estimating the parameters of a statistical model by maximizing the likelihood that the observed data was generated by the model.

Answer 34

A log-normal distribution is a probability distribution of a random variable whose logarithm is normally distributed.

Answer 35

Bootstrap sampling is a technique that involves repeatedly sampling from a dataset with replacement to estimate the distribution of a statistic.

Answer 36

Kurtosis measures the 'tailedness' of a distribution. High kurtosis means more extreme outliers, while low kurtosis indicates a more uniform distribution.

Answer 37

The F-test is used to compare two variances and determine if they are significantly different. It's often used in ANOVA to compare the variances between group means.

Answer 38

Cohen’s d is a measure of effect size, indicating the standardized difference between two group means. It’s commonly used in hypothesis testing to assess the practical significance of results.

Answer 39

Multicollinearity occurs when predictor variables in a regression model are highly correlated, making it difficult to determine the individual effect of each variable on the outcome.

Answer 40

A Chi-square goodness of fit test determines whether observed categorical data fits an expected distribution.

Answer 41

The Gini coefficient is a measure of inequality in a distribution, ranging from 0 (perfect equality) to 1 (perfect inequality), often used in economics and classification models.

Answer 42

Use `.apply()` with a custom function and specify `axis=0` for columns or `axis=1` for rows.

Answer 43

.iloc[] is used for integer-location based indexing, while .loc[] is label-based indexing.

Answer 44

`np.vectorize()` is a convenience function that applies a function element-wise over an array, improving readability and performance compared to a Python loop.

Answer 45

Use `pd.merge(df1, df2, on=['col1', 'col2'])` to merge two DataFrames on multiple columns.

Answer 46

A pivot table aggregates data in a DataFrame based on column(s) and row(s), allowing for easy summary and comparison of values.

Answer 47

Use `.resample()` for date-based grouping, and `.rolling()` or `.expanding()` for calculating rolling or cumulative statistics.

Answer 48

.apply() is used for applying a function along rows or columns, while .applymap() is used for applying a function element-wise to all DataFrame elements.

Answer 49

Use `.get_dummies()` to create one-hot encoding for categorical variables or `.astype('category')` to convert columns to categorical types.

Answer 50

Use `.rolling(window).mean()` to compute the rolling mean for a time series.

Answer 51

Broadcasting allows NumPy to perform arithmetic operations on arrays of different shapes by automatically expanding the smaller array to match the larger one’s shape.

Answer 52

Cross-validation is a technique used to evaluate a model’s performance by splitting data into training and validation sets multiple times, helping to reduce overfitting and provide a more accurate performance estimate.

Answer 53

The bias-variance tradeoff refers to the balance between a model’s ability to generalize (low bias) and its sensitivity to training data (low variance). A high bias leads to underfitting, while a high variance leads to overfitting.

Answer 54

L1 regularization (Lasso) adds the absolute value of coefficients as a penalty to the cost function, promoting sparsity. L2 regularization (Ridge) adds the squared value of coefficients, helping to prevent large coefficients but not promoting sparsity.

Answer 55

A confusion matrix is a table that visualizes the performance of a classification model by showing the actual vs predicted classifications. It includes metrics like precision, recall, F1 score, and accuracy.

Answer 56

Precision is the ratio of true positive predictions to all positive predictions, while recall is the ratio of true positive predictions to all actual positives in the dataset.

Answer 57

Gradient Descent is an optimization algorithm used to minimize the loss function by adjusting model parameters in the opposite direction of the gradient. Variants include batch, stochastic, and mini-batch gradient descent.

Answer 58

Overfitting occurs when a model learns the noise in the training data instead of the underlying patterns, leading to poor performance on new data. It can be prevented by using regularization, cross-validation, and pruning decision trees.

Answer 59

K-fold cross-validation splits the data into K subsets, trains the model K times, each time with a different training set and testing on the corresponding validation set, ensuring a more generalized performance evaluation.

Answer 60

A decision tree is a tree-like structure used for classification or regression. It splits the dataset into subsets based on the feature that maximizes information gain or minimizes impurity.

Answer 61

The curse of dimensionality refers to the exponential increase in complexity as the number of features increases, leading to sparsity in data and reducing the model's ability to generalize.

Answer 62

`INNER JOIN`: Returns rows when there is a match in both tables. `LEFT JOIN`: Returns all rows from the left table and matched rows from the right table. `RIGHT JOIN`: Returns all rows from the right table and matched rows from the left table.

Answer 63

A CTE is a temporary result set defined within the execution scope of a `SELECT`, `INSERT`, `UPDATE`, or `DELETE` statement. It is more readable and reusable compared to subqueries.

Answer 64

`WHERE` is used to filter records before any grouping is done. `HAVING` is used to filter groups after the `GROUP BY` operation.

Answer 65

An index improves the speed of data retrieval operations on a database table at the cost of additional space and time required for `INSERT`, `UPDATE`, and `DELETE` operations.

Answer 66

Normalization is the process of organizing data in a database to avoid redundancy and dependency by dividing large tables into smaller, manageable ones.

Answer 67

`Clustered Index`: The table's data is physically ordered on the disk according to the index. There can only be one clustered index per table. `Non-clustered Index`: The index is stored separately from the table's data, and it points to the data in the table.

Answer 68

`UNION`: Combines the results of two queries and removes duplicates. `UNION ALL`: Combines the results and does not remove duplicates.

Answer 69

A window function performs a calculation across a set of table rows related to the current row. Example: `ROW_NUMBER()`, `RANK()`, `SUM() OVER (PARTITION BY ...)`.

Answer 70

- Use `df.fillna()` to fill missing values with a constant or statistical value (mean, median). - Use `df.dropna()` to remove rows or columns with missing data.

Answer 71

`pivot()` is used when you have a simple reshaping requirement without aggregation. `pivot_table()` allows for aggregation (e.g., sum, mean) during reshaping.

Answer 72

`groupby()`: Splits the data into groups based on certain columns, but requires a follow-up operation (like `sum()`, `mean()`). `agg()`: Aggregates multiple columns at once with different functions.

Answer 73

Use the `merge()` function and specify multiple columns in the `on` parameter: ```python pd.merge(df1, df2, on=['col1', 'col2']) ```

Answer 74

`Classification`: Predicts discrete labels (e.g., spam or not spam). `Regression`: Predicts continuous numerical values (e.g., house price).

Answer 75

A confusion matrix is a table that describes the performance of a classification model, showing the actual vs predicted values. It includes metrics like accuracy, precision, recall, and F1 score.

Answer 76

Cross-validation is used to assess how the results of a statistical analysis generalize to an independent dataset, helping to mitigate overfitting.

Answer 77

`L1 regularization` (Lasso) adds the absolute value of coefficients as a penalty to the loss function. `L2 regularization` (Ridge) adds the squared value of coefficients as a penalty.

Answer 78

A decision tree is a supervised learning model used for classification and regression, which splits the data based on feature values to predict an outcome.

Answer 79

`Bias`: Error due to overly simplistic models (underfitting). `Variance`: Error due to overly complex models (overfitting). Balancing them ensures the model generalizes well.

Answer 80

`Bagging` (e.g., Random Forest) combines multiple models trained independently to reduce variance. `Boosting` (e.g., AdaBoost, XGBoost) sequentially trains models, focusing on correcting errors of previous models to reduce bias.

Answer 81

PCA is a dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional form while preserving as much variance as possible.

Answer 82

CREATE INDEX idx_customer_id ON orders(customer_id);

Answer 83

WITH avg_order_value AS ( SELECT customer_id, AVG(order_value) AS avg_value FROM orders GROUP BY customer_id ) SELECT * FROM avg_order_value;

Answer 84

SELECT employee_id, name, hire_date FROM employees WHERE hire_date < DATE_SUB(CURDATE(), INTERVAL 5 YEAR);

Answer 85

SELECT * FROM orders WHERE order_date >= CURDATE() - INTERVAL 30 DAY;

Answer 86

import pandas as pd data = { 'customer_id': [1, 2, 3], 'order_value': [100, 150, 200], 'order_date': ['2023-01-01', '2023-01-02', '2023-01-03'] } df = pd.DataFrame(data) print(df)

Answer 87

filtered_df = df[df['order_value'] > 150] print(filtered_df)

Answer 88

SELECT MONTH(sales_date) AS month, SUM(sale_amount) AS total_sales FROM sales GROUP BY MONTH(sales_date);

Answer 89

SELECT c.customer_name, o.order_value FROM orders o JOIN customers c ON o.customer_id = c.customer_id;

Answer 90

mean_value = df['order_value'].mean() std_value = df['order_value'].std() print(f'Mean: {mean_value}, Standard Deviation: {std_value}')

Answer 91

df['order_date'] = pd.to_datetime(df['order_date']) df['order_year'] = df['order_date'].dt.year print(df)

Answer 92

SELECT customer_id, SUM(order_value) AS total_sales FROM orders GROUP BY customer_id;

Answer 93

SELECT customer_id, SUM(order_value) AS total_sales FROM orders GROUP BY customer_id ORDER BY total_sales DESC LIMIT 5;

Answer 94

import numpy as np df['log_order_value'] = np.log(df['order_value']) print(df)

Answer 95

df_sorted = df.sort_values(by='order_value', ascending=False) print(df_sorted)

Answer 96

UPDATE orders SET order_value = 250 WHERE order_id = 101;

Answer 97

DELETE FROM orders WHERE order_date < CURDATE() - INTERVAL 1 YEAR;

Answer 98

import matplotlib.pyplot as plt df['order_value'].plot(kind='box') plt.show()

Answer 99

merged_df = pd.merge(df1, df2, on='customer_id') print(merged_df)

Answer 100

SELECT MAX(sale_amount) AS second_highest FROM sales WHERE sale_amount < (SELECT MAX(sale_amount) FROM sales);

Answer 101

SELECT e.department_id, e.employee_id, e.salary FROM employees e JOIN departments d ON e.department_id = d.department_id WHERE e.salary = (SELECT MAX(salary) FROM employees WHERE department_id = e.department_id);