Data Analysis Flashcards

1
Q

Exploratory Data Analysis

A

the process of examining
a dataset to find insights and patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data Cleaning

A

EDA helps in identifying errors, missing values, and outliers that need to be addressed before deeper analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Insight Generation

A

It provides a first look at the data, allowing analysts to develop insights and hypotheses for testing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Model Preparation

A

By understanding relationships and patterns, EDA informs feature selection and model building strategies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Assumption Validation

A

Many statistical models and machine learning algorithms require assumptions about data distribution or variable relationships. EDA validates these assumptions upfront.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Communication

A

EDA techniques generate visualizations and summaries that communicate findings effectively to stakeholders.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Common Data formats

A

CSV/TSV : tabular data, easy to work with
JSON : Nested dictionary format, ideal for hierarchical data.
XML/HTML : Nested data, used for web scraping.
Log data : unstructured text, parsed with Regular Expressions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Granularity

A

The level of detail represented by each record in a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Scope

A

The coverage of the dataset in relation to what we are interested in analyzing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Temporality

A

Refers to the timing aspects of data, crucial for understanding when events occurred,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Two different types of scope

A

Temporal Scope : dataset spans a concise period
Geographical scope: dataset geographically focused on certain hotspots

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Four aspects of temporality

A

Date and time fields - in the calls and stops datasets, datetime fields mark when police interactions were reported.
Timezone awareness - Important to consider timezone and daylight savings for accurate temporal analysis.
Data format - US datetime format (MM/DD/YYYY) used, essential for correct interpretation.
Placeholder dates - watch for default timestamps that may indicate missing values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Faithfulness

A

A dataset is considered faithful if it accurately reflects reality, crucial for reliable analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Common faithfulness issues

A

Unrealistic values - Future dates, nonexistent locations, negative counts, or significant outliers.
Dependency violations - mismatch between related fields, like age and birthdate.
Manual data entry errors - prone to spelling mistakes and inconsistencies
Data falsification indicators - repeated unusual names or email addresses, suggesting fabricated entries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Quantiative Data Plots

A

Histograms, box plots, scatter plots

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Qualitative data plots

A

Bar plots, dot plots, mosaic plots

17
Q

Mixed data type plots

A

Overlaid density curves, side-by-side box plots

18
Q

Rug plots

A

These plots show individual data points as marks along an axis, like threads on a rug. They’re helpful for a small number of observations but become cluttered with larger datasets

19
Q

Two quantitative features in a relationship

A

Use scatterplots to explore relationships, looking for linear or nonlinear patterns

20
Q

One qualitative and one quantitative variable in a relationship

A

Divide data into groups based on the qualitative feature and compare the quantitative distribution across these groups.
Density curves, box plots, and violin plots help compare distributions, highlighting differences in spread, central tendency and outliers

21
Q

Two qualitative features in a relationship

A

Compare the distribution of one feature across subgroups defined by another feature, focusing on proportions.

22
Q

Scatter matrix

A

Shows the correlations between each variable. They can also help reveal non linear correlations