VA Session 6 ! Flashcards
1
Q
Variable types
A
- Numerical = numbers, ordered set & magnitude known
- Continuous = infinite e.g. temperature, age
- Discrete = finite e.g. shoe size, nr of children
- Categorical = words (also discrete)
- Nominal = no order / hierarchy e.g. gender (M,F) e.g. hair color
- Ordinal = order / hierarchy but magnitude btw successive values not known e.g. clothing size (XS, S, M)
2
Q
Formats to store data
A
- Wide = each column represents a different variable
- Long = one column contains all the variables & another gives the value for that variable
3
Q
Exploratory Data Analysis
A
- Summary statistics
- visually exploring (important, e.g. of dragon data where different visualisations with same summary statistics)
4
Q
Summary Statistics
A
- Quick overview & formulate new hypotheses
- Calculating statistics: mean, median, mode, variance, standard deviation, range
- Check datatypes
5
Q
Matplotlib
A
- “standard” plotting library for Python
- Pro: visualizations are extremely customizable
- Con: syntaxy little complicated (many lines of code)
6
Q
Seaborn
A
- higher-level library based on Matplotlib (<-> prettier plots with fewer lines)
- Works with Pandas DataFrames, arrays & NumPy arrays & standard Python structures (e.g. lists & dictionaries)
- can access columns of DataFrame just by column name
7
Q
Scatter plots
A
- for inspecting relationship btw two variables (bivariate data), reveal e.g. clusters or correlations between variables
- e.g. Ice Cream Consumption & temperature of day
8
Q
Histograms
A
- understand distribution (frequency) of variables
- continuous (not finit) variable & quantiative data (frequency)
- con: bin size (how often data divided) -> can significantly affect interpretation
9
Q
Density Plots
A
- understand distribution of variables
- continuous (not finit) variable & density
- not issue of bin size
- difficult to interpret
10
Q
Line Charts
A
- good to visualize time series data, easy to see trends
- If “jumps”: look closer, could be a special occasion or measurement mistake)
- plot 95% confidence interval & mean
11
Q
Bar Chart
A
-discrete variable (countable, finit) & categorical data
- e.g. countries & body_mass_index
12
Q
Boxplot
A
- comparing distributions in categorical data
- outliers
- max, min
- middle = 50% of data points (27th & 75th percentile)
- middle = medin
13
Q
Heatmaps
A
- see relationship between 2 variables & their different values
- color = magnitude of some measurement
- e.g. number of flights by month & year, smaller numbers = darker
14
Q
Cleaning & preprocessing data - Common issues
A
- Missing values
- Faulty data
- Wrong data type
- Duplicates (found during exploration)
15
Q
Data Cleaning
A
- Structuring data
- identifying & handling missing values (Remove rows, columns)
- Drop unnecessary variables (columns)
- checking for outliers
- Convert fields with text -> numeric
- Scaling or normalizing data
- Creating new variables
- Renaming variables (remove capitalization, spaces)