intro and EDA IMP stuff Flashcards

Question 1

Q

What are the three main types of data?

Answer

A

Structured Data: Organized and easily searchable (e.g., relational databases).
Unstructured Data: Unorganized formats (e.g., emails, social media posts).
Semi-structured Data: Partially organized (e.g., XML, JSON).

Question 2

Q

What is the difference between qualitative and quantitative data?

Answer

A

Qualitative Data: Descriptive, non-numerical (e.g., gender, country).
Quantitative Data: Measurable, numerical (e.g., height, weight).

Question 3

Q

What is the difference between raw and processed data?

Answer

A

Raw Data: Original, unprocessed, and often unsuitable for analysis.
Processed Data: Cleaned, transformed, and organized for analysis.

Question 4

Q

What are the five measures of data quality?

Answer

A

Accuracy: Correct or incorrect.
Completeness: All necessary data present.
Consistency: No conflicting values.
Timeliness: Data is up-to-date.
Believability: Data is trustworthy.

Question 5

Q

What are common data quality issues?

Answer

A

Inconsistency: Different formats for the same data across systems.
Noisy Data: Contains errors or outliers.
Missing Data: Values are unrecorded or deleted.

Question 6

Q

What techniques are used to handle noisy data?

Answer

A

Binning: Smooth data by grouping into bins.
Regression: Fit data to regression models.
Clustering: Identify and remove outliers.
Hybrid Methods: Combine automated detection with manual validation.

Question 7

Q

What are the steps in the data cleaning process?

Answer

A

Detect discrepancies.
Scrub data (fix errors).
Audit data (validate fixes).
Migrate and integrate data.

Question 8

Q

What is a codebook, and why is it important?

Answer

A

A codebook describes variables, their meanings, and study designs. It is essential for data sharing and reproducibility.

Question 9

Q

What is the goal of Exploratory Data Analysis (EDA)?

Answer

A

To understand the data, identify patterns, trends, and outliers before applying advanced techniques.

Question 10

Q

What are the three measures of central tendency?

Answer

A

Mean: Average, sensitive to outliers.
Median: Middle value, robust to outliers.
Mode: Most frequent value.

Question 11

Q

What are the measures of variability?

Answer

A

Variance: Average squared deviation from the mean.
Standard Deviation: Spread of data around the mean.
Interquartile Range (IQR): Difference between Q3 and Q1.

Question 12

Q

What are common visualizations used in EDA?

Answer

A

Histograms: Show frequency distributions.
Box Plots: Visualize 5-number summaries and outliers.
Scatter Plots: Identify relationships between two variables.

Question 13

Q

What is the bias-variance tradeoff?

Answer

A

Bias: Underfitting, too simplistic models.
Variance: Overfitting, overly complex models.

Question 14

Q

What are the principles of good models?

Answer

A

Probabilistic: Reflect uncertainty.
Feedback: Update predictions with new data.
Consensus: Combine results across models for reliability.

Question 15

Q

What is the purpose of hypothesis testing?

Answer

A

To test assumptions about a population parameter (e.g., mean, proportion).

Question 16

Q

What is the difference between a null and alternative hypothesis?

Answer

Study These Flashcards

A

Null Hypothesis (H0): Assumes no effect or status quo.
Alternative Hypothesis (H1): Suggests an effect or deviation from the null.

Question 17

Q

What are Type I and Type II errors?

Answer

Study These Flashcards

A

Type I: False positive (rejecting a true null hypothesis).
Type II: False negative (failing to reject a false null hypothesis).

Question 18

Q

What is the rejection region in hypothesis testing?

Answer

Study These Flashcards

A

The range of values that lead to rejecting the null hypothesis, based on a significance level (α).

intro and EDA IMP stuff Flashcards

(18 cards)