Data Analytics Flashcards
What is data analytics?
Data analytics is the process of inspecting, cleansing, transforming and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.
Extract actionable, but non-obvious information from data.
What is statistics?
Statistics is about hypothesis testing. You assume a relation, propose a model, collect data to test the model, perform statistical analysis and evaluate the results.
As such, we are backing up an assumed relation with data.
What is machine learning?
Machine learning is the science of teaching machines how to learn from data, without being explicitly programmed to do so.
What is the difference between statistics and machine learning?
Statistics starts from a proposed model, whereas machine learning builds a model from data. Statistics requires a normally distributed data in order to validate results. Machine learning does not always rely on the distribution characteristics of data.
Statistics has implicit validation via the significance level. Machine learning performs explicit validation by counting errors using labeled cases.
What are the advantages of statistics?
- Quantification of effects (estimations for intercept and slope).
- Implicit testing of significance (likelihood of finding a pattern by coincidence)
What are the disadvantages of statistics?
- Starts from a proposed model (hypothesis) (confirmatory analysis)
- Makes assumptions on data distribution (otherwise no correct estimation of significance)
- Choice of significance level is not straightforward.
- significance level too high means that the conclusion that the pattern exists is wrong.
- significance level too low means that the conclusion that the pattern does not exist, is wrong.
What are the advantages of machine learning?
Does not always rely on the distribution of your data. Derives a model from your data, instead of proving a model with data. Does explicit validation by counting errors.
What are the disadvantages of machine learning?
Requires labeled data to perform explicit validation. There is a risk of overfitting.
What are the essential points of statistics?
. . . . .
What is significance?
Wat is de kans dat mijn model toeval is, berekend op basis van de distributie van uw data. De data moet normaal verdeeld zijn. Als de data niet normaal verdeeld is, wordt de significantie verkeerd berekend.
Lage significantie betekent dat de kans dat je patroon uit toeval voorkomt groot is. Het resultaat is dus niet te vertrouwen.
Hoge significantie betekent dat de kans dat je patroon uit toeval komt klein is. Het resultaat is dus meer te vertrouwen.
How do you calculate precision?
TP/(TP+FP)
What are the essential points of machine learning?
- Derive model from data.
- Explicit validation by counting errors.
- Beware of overfitting.
What is a model?
Combination of formula to transform input data into output (classification or prediction)
How do you detect/check for overfitting?
By using a test set.
What is meant by training set?
This is the dataset that is used to train the model.
How do you validate a model?
Using a test or validation set. You calculate the performance by counting the errors your model has made. These can derive useful metrics like precision, recall and accuracy.
What is a confusion matrix?
A confusion matrix is a matrix that shows the types of errors a model makes.
It shows true positives, false positives, true negatives and false negatives. These can be used to calculate performance metrics.
How do you interpret a confusion matrix?
A confusion matrix tells us the performance of the model. It shows us the correct classifications on the main diagonal, and the incorrect classifications on the other diagonal.
This can give us metrics such as accuracy, precision and recall.
What can you learn from a confusion matrix?
How well a model performs and what types of errors it makes.
What is accuracy? How do you interpret it? What can you learn from it?
Accuracy is the amount of correct predictions a model makes. Caution has to be made when using accuracy metrics against unbalanced datasets. A simple model that always predicts the majority class will also score very well on this metric.
What is precision? How do you interpret it? What can you learn from it?
Precision shows us how good the model is at predicting the true positive case. It should be interpreted as the higher the number the better: the higher the number, the fewer cases are misclassified as positive.
What is recall? How do you interpret it? What can you learn from it?
Recall should be interpreted as how good is it at identifying positive cases. A high recall means it’s very good at identifying positive cases, a low recall means it misses many of them.
You can learn how good your model is at determining the positive case from it.
What kind of problems can you solve with machine learning?
Regression. Classification. Clustering. Association Rule Discovery.
Why is data analytics relevant for managers?
Money, money, money. Because it will help you make faster, better decisions. It will help you reduce costs. It will lead you to new products and services.
How can value be created from data analytics (multiple ways)?
Marketing: churn prediction, sentiment analysis.
Banking & Insurance: fraud detection, credit scoring.
Retail: recommender systems, shop behaviour.
Production: maintenance optimization.
Logistics: replenishment planning.
HR: CV matching.
Health: imaging, diabetes control, air quality monitoring.
Security: intelligence, smart cameras, crowd monitoring.
What is so new about data & analytics?
Nothing specifically. What’s new is that the mass availability of data and computer power at low prices has enabled it for a modern market.
What is meant by the trade-off between precision and recall?
It’s typically hard to get both good precision and good recall. It’s usually one or the other: the higher your precision gets, the lower your recall becomes.
What is the precision and recall if the model always says yes in a binary classification model?
In a binary classification model where the model always says yes, the precision will be very low, but the recall will be perfect.
(it will predict many false positives, but no false negatives)
What is the precision and recall if the model always says no in a binary classification model?
In a binary classification model where the model always says no, the precision will be infinite, but the recall will be 0.
(it will always say no, so there will be no false positives, but there will be many false negatives)