Week 1 Flashcards
Data Mining
difference between the Statisticians and Machine learning
Statisticians tend to start by making modelling assumptions about how the data is generated. Generally
these assumptions then give a mathematical framework in which to answer specific questions.
Machine learning people tend to treat the mechanism that generates the data as unknown (or
unknowable) and are happy to use any algorithmic model that gets the job done
Key steps in data mining
- Collect data (or get given it).
- Wrangle the data into shape.
- Train models (the more the better!)
- Choose the best model.
- Use the best model for prediction
Data wrangling
data wrangling consists of doing everything necessary to get datasets ‘tidy’ and ready for
modelling.
choose variables (columns) by name
select()
to choose observations (rows) by value.
filter()
to add new variables based of existing variables
mutate()
to reduce multiple values down to a single summary.
summaries()
changes the order of rows
arrange()
If you want to rename a column while keeping the other columns
rename()
You can remove grouping
ungroup()
The function adds a count column instead of summarising
add_count()
the function useful for finding the top (or bottom) few entries.
slice_min()
and
slice_max()
function can be used to take a random sample
slice_sample()