Data Mining - Chapter 2 Flashcards
What is classification?
Examining data and deciding in which class or category they will fall. --> Trying to predict a class
What is prediction?
Trying to predict the value of a numerical variable.
–> Can be used for both continious as categorical data.
What are association rules?
Rules designed to find general association patterns between items in a large database. Generates rules general to an entire population.
What is collaborative filtering?
Making rules for an invidivual user opposed to the general public, based on individual history as well as the history of others.
What is data reduction?
The process of consolidating a large number of records into a smaller set.
- -> You do this because the performance of data mining algorithms is often improved when the number of variables is limited.
- -> Often done by clustering
What is dimension reduction?
Reducing the numer of variables (instead of the number of rows).
What is data visualization?
Data exploration through creating charts and dashboards.
What are supervised learning algorithms?
Algorithms that predict numerical values or classifications tht are trained by using training, validation and testing data.
Of the training data, it is already known what the value of the outcome of interest is. Therefore, you can see how well the algorithm performs, you can tune it with validiaton data and you can measure it against other algorithms.
What are unsupervised learning algorithms?
Algorithms that use no outcome variable to predict or classify.
Examples: association rules, dimension reductions methods and clustering techniques.
What are the 10 steps of data mining?
- Develop an understanding of the purpose of the data mining project.
- Obtain the dataset to be used in the analysis.
- Explore, clean, and preprocess the data
- Reduce the data dimension, if necessary
- Determine the data mining task
- Partition the data
- Choose the data mining techniques to be used
- Use algorithms to perform the task
- Interpret the results of the algorithms
- Deploy the model.
What is SEMMA?
A methodology of data mining by the company SAS. It encompasses the previous 10 steps.
- Sample
Take a sample. Partition into training/testing - Explote
Examine data set statistically and graphically - Modify
Transform variables/put in missing values - Model
Fit predicitive models - Assess
Compare models using a validation dataset.
What is a slice of data?
A slice returns an object usually containing a portion of a sequence, such as a subset of rows and columns from a data frame.
Which two techniques does pandas use to access rows in a data frame?
- loc
More general, allows accessing rows using labels - iloc
Less general, only allows using integer numbers.
What is oversampling?
Putting heavier weights in your sampling procedure to overweight the rare class relative to the majority class. Otherwise your model might not be able to identify that records belong to the rare class.
Which types of variables are there?
- Numerical (Continious, integer & date)
- Text
- Categorical (numerical/text)
- Nominal
- Ordinal