intro and EDA IMP stuff Flashcards

1
Q

What are the three main types of data?

A

Structured Data: Organized and easily searchable (e.g., relational databases).
Unstructured Data: Unorganized formats (e.g., emails, social media posts).
Semi-structured Data: Partially organized (e.g., XML, JSON).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the difference between qualitative and quantitative data?

A

Qualitative Data: Descriptive, non-numerical (e.g., gender, country).
Quantitative Data: Measurable, numerical (e.g., height, weight).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the difference between raw and processed data?

A

Raw Data: Original, unprocessed, and often unsuitable for analysis.
Processed Data: Cleaned, transformed, and organized for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the five measures of data quality?

A

Accuracy: Correct or incorrect.
Completeness: All necessary data present.
Consistency: No conflicting values.
Timeliness: Data is up-to-date.
Believability: Data is trustworthy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are common data quality issues?

A

Inconsistency: Different formats for the same data across systems.
Noisy Data: Contains errors or outliers.
Missing Data: Values are unrecorded or deleted.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What techniques are used to handle noisy data?

A

Binning: Smooth data by grouping into bins.
Regression: Fit data to regression models.
Clustering: Identify and remove outliers.
Hybrid Methods: Combine automated detection with manual validation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the steps in the data cleaning process?

A

Detect discrepancies.
Scrub data (fix errors).
Audit data (validate fixes).
Migrate and integrate data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a codebook, and why is it important?

A

A codebook describes variables, their meanings, and study designs. It is essential for data sharing and reproducibility.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the goal of Exploratory Data Analysis (EDA)?

A

To understand the data, identify patterns, trends, and outliers before applying advanced techniques.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the three measures of central tendency?

A

Mean: Average, sensitive to outliers.
Median: Middle value, robust to outliers.
Mode: Most frequent value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the measures of variability?

A

Variance: Average squared deviation from the mean.
Standard Deviation: Spread of data around the mean.
Interquartile Range (IQR): Difference between Q3 and Q1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are common visualizations used in EDA?

A

Histograms: Show frequency distributions.
Box Plots: Visualize 5-number summaries and outliers.
Scatter Plots: Identify relationships between two variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the bias-variance tradeoff?

A

Bias: Underfitting, too simplistic models.
Variance: Overfitting, overly complex models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the principles of good models?

A

Probabilistic: Reflect uncertainty.
Feedback: Update predictions with new data.
Consensus: Combine results across models for reliability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the purpose of hypothesis testing?

A

To test assumptions about a population parameter (e.g., mean, proportion).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the difference between a null and alternative hypothesis?

A

Null Hypothesis (H0): Assumes no effect or status quo.
Alternative Hypothesis (H1): Suggests an effect or deviation from the null.

17
Q

What are Type I and Type II errors?

A

Type I: False positive (rejecting a true null hypothesis).
Type II: False negative (failing to reject a false null hypothesis).

18
Q

What is the rejection region in hypothesis testing?

A

The range of values that lead to rejecting the null hypothesis, based on a significance level (α).