intro and EDA IMP stuff Flashcards
What are the three main types of data?
Structured Data: Organized and easily searchable (e.g., relational databases).
Unstructured Data: Unorganized formats (e.g., emails, social media posts).
Semi-structured Data: Partially organized (e.g., XML, JSON).
What is the difference between qualitative and quantitative data?
Qualitative Data: Descriptive, non-numerical (e.g., gender, country).
Quantitative Data: Measurable, numerical (e.g., height, weight).
What is the difference between raw and processed data?
Raw Data: Original, unprocessed, and often unsuitable for analysis.
Processed Data: Cleaned, transformed, and organized for analysis.
What are the five measures of data quality?
Accuracy: Correct or incorrect.
Completeness: All necessary data present.
Consistency: No conflicting values.
Timeliness: Data is up-to-date.
Believability: Data is trustworthy.
What are common data quality issues?
Inconsistency: Different formats for the same data across systems.
Noisy Data: Contains errors or outliers.
Missing Data: Values are unrecorded or deleted.
What techniques are used to handle noisy data?
Binning: Smooth data by grouping into bins.
Regression: Fit data to regression models.
Clustering: Identify and remove outliers.
Hybrid Methods: Combine automated detection with manual validation.
What are the steps in the data cleaning process?
Detect discrepancies.
Scrub data (fix errors).
Audit data (validate fixes).
Migrate and integrate data.
What is a codebook, and why is it important?
A codebook describes variables, their meanings, and study designs. It is essential for data sharing and reproducibility.
What is the goal of Exploratory Data Analysis (EDA)?
To understand the data, identify patterns, trends, and outliers before applying advanced techniques.
What are the three measures of central tendency?
Mean: Average, sensitive to outliers.
Median: Middle value, robust to outliers.
Mode: Most frequent value.
What are the measures of variability?
Variance: Average squared deviation from the mean.
Standard Deviation: Spread of data around the mean.
Interquartile Range (IQR): Difference between Q3 and Q1.
What are common visualizations used in EDA?
Histograms: Show frequency distributions.
Box Plots: Visualize 5-number summaries and outliers.
Scatter Plots: Identify relationships between two variables.
What is the bias-variance tradeoff?
Bias: Underfitting, too simplistic models.
Variance: Overfitting, overly complex models.
What are the principles of good models?
Probabilistic: Reflect uncertainty.
Feedback: Update predictions with new data.
Consensus: Combine results across models for reliability.
What is the purpose of hypothesis testing?
To test assumptions about a population parameter (e.g., mean, proportion).