Data Analysis Flashcards

Question 1

Q

Exploratory Data Analysis

Answer

A

the process of examining
a dataset to find insights and patterns.

Question 2

Q

Data Cleaning

Answer

A

EDA helps in identifying errors, missing values, and outliers that need to be addressed before deeper analysis.

Question 3

Q

Insight Generation

Answer

A

It provides a first look at the data, allowing analysts to develop insights and hypotheses for testing.

Question 4

Q

Model Preparation

Answer

A

By understanding relationships and patterns, EDA informs feature selection and model building strategies.

Question 5

Q

Assumption Validation

Answer

A

Many statistical models and machine learning algorithms require assumptions about data distribution or variable relationships. EDA validates these assumptions upfront.

Question 6

Q

Communication

Answer

A

EDA techniques generate visualizations and summaries that communicate findings effectively to stakeholders.

Question 7

Q

Common Data formats

Answer

A

CSV/TSV : tabular data, easy to work with
JSON : Nested dictionary format, ideal for hierarchical data.
XML/HTML : Nested data, used for web scraping.
Log data : unstructured text, parsed with Regular Expressions.

Question 8

Q

Granularity

Answer

A

The level of detail represented by each record in a dataset.

Question 9

Q

Scope

Answer

A

The coverage of the dataset in relation to what we are interested in analyzing.

Question 10

Q

Temporality

Answer

A

Refers to the timing aspects of data, crucial for understanding when events occurred,

Question 11

Q

Two different types of scope

Answer

A

Temporal Scope : dataset spans a concise period
Geographical scope: dataset geographically focused on certain hotspots

Question 12

Q

Four aspects of temporality

Answer

A

Date and time fields - in the calls and stops datasets, datetime fields mark when police interactions were reported.
Timezone awareness - Important to consider timezone and daylight savings for accurate temporal analysis.
Data format - US datetime format (MM/DD/YYYY) used, essential for correct interpretation.
Placeholder dates - watch for default timestamps that may indicate missing values.

Question 13

Q

Faithfulness

Answer

A

A dataset is considered faithful if it accurately reflects reality, crucial for reliable analysis.

Question 14

Q

Common faithfulness issues

Answer

A

Unrealistic values - Future dates, nonexistent locations, negative counts, or significant outliers.
Dependency violations - mismatch between related fields, like age and birthdate.
Manual data entry errors - prone to spelling mistakes and inconsistencies
Data falsification indicators - repeated unusual names or email addresses, suggesting fabricated entries.

Question 15

Q

Quantiative Data Plots

Answer

A

Histograms, box plots, scatter plots

Question 16

Q

Qualitative data plots

Answer

Study These Flashcards

A

Bar plots, dot plots, mosaic plots

Question 17

Q

Mixed data type plots

Answer

Study These Flashcards

A

Overlaid density curves, side-by-side box plots

Question 18

Q

Rug plots

Answer

Study These Flashcards

A

These plots show individual data points as marks along an axis, like threads on a rug. They’re helpful for a small number of observations but become cluttered with larger datasets

Question 19

Q

Two quantitative features in a relationship

Answer

Study These Flashcards

A

Use scatterplots to explore relationships, looking for linear or nonlinear patterns

Question 20

Q

One qualitative and one quantitative variable in a relationship

Answer

Study These Flashcards

A

Divide data into groups based on the qualitative feature and compare the quantitative distribution across these groups.
Density curves, box plots, and violin plots help compare distributions, highlighting differences in spread, central tendency and outliers

Question 21

Q

Two qualitative features in a relationship

Answer

Study These Flashcards

A

Compare the distribution of one feature across subgroups defined by another feature, focusing on proportions.

Question 22

Q

Scatter matrix

Answer

Study These Flashcards

A

Shows the correlations between each variable. They can also help reveal non linear correlations

Data Analysis Flashcards

(22 cards)