Week 5: Data Exploring + Pre-Processing Flashcards
Mean
Average of all numbers
Median
Middle number in a sequence
Mode
Number that occurs most often within a set
Range
Difference between highest and lowest values
Standard Derivation is a measure used to
quantify the amount of variation of data values
Histogram (2 points)
- is similiar to
- gives a rough sense of
- a bar chart but groups numbers into ranges (bins)
- density
Name the distribution
normal
Name the distribution
right skewed (where tail goes)
Name type of distribution
Multimodal
Draw
- positive linear association
- negative linear association
- non-linear associaition
- no association
Scatter plots show…
how much one variable is affected by another
Correlations show
how strongly pairs of variables are related
What is the measure of correlation?
correlation coefficient r
1 is perfect
0 is no correlation
-1 is perfectly negative correlation
An outlier is
an observation that lies an abnormal distance from other values in a random sample
How do you identify outliers? (BPRD)
- box plot
- probablitity plot
- dions test
- rosners test
Handling missing values on small scale < 5%
Drop or omit
Handling missing values on larger scale methods (MKFS)
- Mean
- K-nearest neighbors
- fuzzy K-means
- singular value decomposition
3 types of invalid data
missing data values
invalid values that suggest true values
invalid values that provide no information regarding true values
What is scaling?
scaling features to lie between a given minium and maximum value
Transformation is…
converting data from one format or structure into another format or structure
Feature selection is…
the process of selecting a subset of relevant features for use in model construction
4 Reasons for using feature selection (REIR)
reduces the complexity of a model
enables the machine learning algorithm to train faster
improve the accuracy if the right subset is chosen
reduces overfitting
5 methods for dimensionality reduction
- Decision Tree
- Random forest
- high correlation
- factor analysis
- principal component analysis
Dimensionality reduction…
creates new combination of attributes