VA Session 6 ! Flashcards

Question 1

Q

Variable types

Answer

A

Numerical = numbers, ordered set & magnitude known
Continuous = infinite e.g. temperature, age
Discrete = finite e.g. shoe size, nr of children
Categorical = words (also discrete)
Nominal = no order / hierarchy e.g. gender (M,F) e.g. hair color
Ordinal = order / hierarchy but magnitude btw successive values not known e.g. clothing size (XS, S, M)

Question 2

Q

Formats to store data

Answer

A

Wide = each column represents a different variable
Long = one column contains all the variables & another gives the value for that variable

Question 3

Q

Exploratory Data Analysis

Answer

A

Summary statistics
visually exploring (important, e.g. of dragon data where different visualisations with same summary statistics)

Question 4

Q

Summary Statistics

Answer

A

Quick overview & formulate new hypotheses
Calculating statistics: mean, median, mode, variance, standard deviation, range
Check datatypes

Question 5

Q

Matplotlib

Answer

A

“standard” plotting library for Python
Pro: visualizations are extremely customizable
Con: syntaxy little complicated (many lines of code)

Question 6

Q

Seaborn

Answer

A

higher-level library based on Matplotlib (<-> prettier plots with fewer lines)
Works with Pandas DataFrames, arrays & NumPy arrays & standard Python structures (e.g. lists & dictionaries)
can access columns of DataFrame just by column name

Question 7

Q

Scatter plots

Answer

A

for inspecting relationship btw two variables (bivariate data), reveal e.g. clusters or correlations between variables
e.g. Ice Cream Consumption & temperature of day

Question 8

Q

Histograms

Answer

A

understand distribution (frequency) of variables
continuous (not finit) variable & quantiative data (frequency)
con: bin size (how often data divided) -> can significantly affect interpretation

Question 9

Q

Density Plots

Answer

A

understand distribution of variables
continuous (not finit) variable & density
- not issue of bin size
- difficult to interpret

Question 10

Q

Line Charts

Answer

A

good to visualize time series data, easy to see trends
If “jumps”: look closer, could be a special occasion or measurement mistake)
plot 95% confidence interval & mean

Question 11

Q

Bar Chart

Answer

A

-discrete variable (countable, finit) & categorical data
- e.g. countries & body_mass_index

Question 12

Q

Boxplot

Answer

A

comparing distributions in categorical data
outliers
max, min
middle = 50% of data points (27th & 75th percentile)
middle = medin

Question 13

Q

Heatmaps

Answer

A

see relationship between 2 variables & their different values
color = magnitude of some measurement
e.g. number of flights by month & year, smaller numbers = darker

Question 14

Q

Cleaning & preprocessing data - Common issues

Answer

A

Missing values
Faulty data
Wrong data type
Duplicates (found during exploration)

Question 15

Q

Data Cleaning

Answer

A

Structuring data
identifying & handling missing values (Remove rows, columns)
Drop unnecessary variables (columns)
checking for outliers
Convert fields with text -> numeric
Scaling or normalizing data
Creating new variables
Renaming variables (remove capitalization, spaces)

Question 16

Q

Removing values vs imputing

Answer

Study These Flashcards

A

Depends on situation

Question 17

Q

Removing values

Answer

Study These Flashcards

A

Less risky
But: could make dataset too small or imbalanced
Remove columns with lots of missing values rather than rows

Question 18

Q

Imputing

Answer

Study These Flashcards

A

replacing missing data with substituted values
use non-missing values of same feature (column) to give substitute, e.g. mean, mode, median
Univariate imputation: use one column
Multivariate imputation: use (entire) set of features (columns)
nearest neighbor imputation: replaced by a value obtained from related cases in the whole set of record

Question 19

Q

Problem & Methods for scale of data

Answer

Study These Flashcards

A

some statistical methods & machine learning algorithms = sensitive to scale or assume normality e.g. one variable range of 0-1, other 100-10.000 (plotting could look very strange)
Normalization
standardization
log transformed