Session 3 Flashcards
what are the two most common processes for data mining?
- crisp-dm
- Semma
what is CRSIP DM? (6 steps)
- business understanding
- data understanding
- data preparation
- model building
- testing and evaluation
- deployment
what is important in the business and data understanding stages?
- nature of data may differ (different sources)
- data is securely uoloaded,m stored and analysed
- it is important to understand, to be able to realize what is likely impacted by what and which meadning which data has (means, medians, distributions etc)
Business: understand content and business needs and applications
Data: find patterns, trends and knowledge without complex models
what four steps are happening in the data preparation step?
- data consolirdation
- data cleaning
- data transformation
- data reduction
what happends durimg data consolidation?
collect, selevt and integrate data (same formats for variables)
what happens during data cleaning?
impute missing values (take averages)
reduce noice (errors, outliers etc.)
eliminate duplicates
what happens during data transformation?
normalize data (one varibla 0-10000, others 0-4, match scales to prevent disproportionality)
discretize data (data from 0-100, make 4 buckets instead -> we did this with price)
create new attributes (ratio, amenities etc.)
what happens during data reduction?
reduce dimensions of data (states: US/ regions)
reduce volume (less important as computers become more capable)
balance data (not all zeros or 1s, we did this with TVs or Bathrooms)
what are the three types of patterns that data mining is supposed to find?
- prediction
- association
- segmentation
what is prediction? (what three types can be done)
clear trends to allow prediction:
what are the two learning types?
supervised
un-supervised
what is supervised learning?
known classes -> output you want to have is known and algorithms is trained of labelled historic data set: input data has corresponsidng output labels / target labels)
–> trained model should minimiue differences between predicted and true outputs
what is unsupervised learning?
known variable –> algorithm tries to find values on its own based on patterns and strcutures (e.g., clustering)
what is a regression?
based on historical data in a scatterplot, we estimate a line that minimizes the deviation between observation and this regression line
-> regression line can then be used as a prediction
what are the two variables in a regression?
dependent: left (y axis), e.g., price
independent: right (x axis), e.g., number of amenities
what is a classification?
- used to analysie historial data to automativally generate a model that can predict future behaviour
- in contrast to regression: here output vaiable is categorial (nominal or ordinal) not any prices, but either 0 or 1
what are the four classification techniques?
- logsitic regression
- decision trees (ID3)
- artificial neural networs
- support vector machines,bayesian classifies
what is important terminology in data science/ mining?
features/attributes
target variable/ attribute/ label
Bias
features/attributes
target variable/ attribute/ label
Bias=intercept
what is the confusion matrix?
tells us about how good the predicition is
what is accuracy?
% probability for true predictors