VA Session 6 ! Flashcards

1
Q

Variable types

A
  • Numerical = numbers, ordered set & magnitude known
  • Continuous = infinite e.g. temperature, age
  • Discrete = finite e.g. shoe size, nr of children
  • Categorical = words (also discrete)
  • Nominal = no order / hierarchy e.g. gender (M,F) e.g. hair color
  • Ordinal = order / hierarchy but magnitude btw successive values not known e.g. clothing size (XS, S, M)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Formats to store data

A
  • Wide = each column represents a different variable
  • Long = one column contains all the variables & another gives the value for that variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Exploratory Data Analysis

A
  • Summary statistics
  • visually exploring (important, e.g. of dragon data where different visualisations with same summary statistics)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Summary Statistics

A
  • Quick overview & formulate new hypotheses
  • Calculating statistics: mean, median, mode, variance, standard deviation, range
  • Check datatypes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Matplotlib

A
  • “standard” plotting library for Python
  • Pro: visualizations are extremely customizable
  • Con: syntaxy little complicated (many lines of code)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Seaborn

A
  • higher-level library based on Matplotlib (<-> prettier plots with fewer lines)
  • Works with Pandas DataFrames, arrays & NumPy arrays & standard Python structures (e.g. lists & dictionaries)
  • can access columns of DataFrame just by column name
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Scatter plots

A
  • for inspecting relationship btw two variables (bivariate data), reveal e.g. clusters or correlations between variables
  • e.g. Ice Cream Consumption & temperature of day
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Histograms

A
  • understand distribution (frequency) of variables
  • continuous (not finit) variable & quantiative data (frequency)
  • con: bin size (how often data divided) -> can significantly affect interpretation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Density Plots

A
  • understand distribution of variables
  • continuous (not finit) variable & density
    • not issue of bin size
    • difficult to interpret
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Line Charts

A
  • good to visualize time series data, easy to see trends
  • If “jumps”: look closer, could be a special occasion or measurement mistake)
  • plot 95% confidence interval & mean
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Bar Chart

A

-discrete variable (countable, finit) & categorical data
- e.g. countries & body_mass_index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Boxplot

A
  • comparing distributions in categorical data
  • outliers
  • max, min
  • middle = 50% of data points (27th & 75th percentile)
  • middle = medin
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Heatmaps

A
  • see relationship between 2 variables & their different values
  • color = magnitude of some measurement
  • e.g. number of flights by month & year, smaller numbers = darker
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Cleaning & preprocessing data - Common issues

A
  • Missing values
  • Faulty data
  • Wrong data type
  • Duplicates (found during exploration)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Data Cleaning

A
  • Structuring data
  • identifying & handling missing values (Remove rows, columns)
  • Drop unnecessary variables (columns)
  • checking for outliers
  • Convert fields with text -> numeric
  • Scaling or normalizing data
  • Creating new variables
  • Renaming variables (remove capitalization, spaces)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Removing values vs imputing

A

Depends on situation

17
Q

Removing values

A
  • Less risky
  • But: could make dataset too small or imbalanced
  • Remove columns with lots of missing values rather than rows
18
Q

Imputing

A
  • replacing missing data with substituted values
  • use non-missing values of same feature (column) to give substitute, e.g. mean, mode, median
  • Univariate imputation: use one column
  • Multivariate imputation: use (entire) set of features (columns)
  • nearest neighbor imputation: replaced by a value obtained from related cases in the whole set of record
19
Q

Problem & Methods for scale of data

A
  • some statistical methods & machine learning algorithms = sensitive to scale or assume normality e.g. one variable range of 0-1, other 100-10.000 (plotting could look very strange)
  • Normalization
  • standardization
  • log transformed
20
Q

Creating new variables

A

sometimes good to combine features into one feature by creating a ration of 2
e.g. Combining weight & height to BMI

21
Q

Normalization

A
  • same scale, from 0 to 1
  • sklearn.preprocessing.MinMaxScaler()
22
Q

Standardization

A
  • data points expressed as Standard derivation from mean
  • sklearn.preprocessing.StandardScaler()