Data Analytics Flashcards

Question 1

Q

What is data analytics?

Answer

A

Data analytics is the process of inspecting, cleansing, transforming and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.

Extract actionable, but non-obvious information from data.

Question 2

Q

What is statistics?

Answer

A

Statistics is about hypothesis testing. You assume a relation, propose a model, collect data to test the model, perform statistical analysis and evaluate the results.

As such, we are backing up an assumed relation with data.

Question 3

Q

What is machine learning?

Answer

A

Machine learning is the science of teaching machines how to learn from data, without being explicitly programmed to do so.

Question 4

Q

What is the difference between statistics and machine learning?

Answer

A

Statistics starts from a proposed model, whereas machine learning builds a model from data. Statistics requires a normally distributed data in order to validate results. Machine learning does not always rely on the distribution characteristics of data.

Statistics has implicit validation via the significance level. Machine learning performs explicit validation by counting errors using labeled cases.

Question 5

Q

What are the advantages of statistics?

Answer

A

Quantification of effects (estimations for intercept and slope).
Implicit testing of significance (likelihood of finding a pattern by coincidence)

Question 6

Q

What are the disadvantages of statistics?

Answer

A

Starts from a proposed model (hypothesis) (confirmatory analysis)
Makes assumptions on data distribution (otherwise no correct estimation of significance)
Choice of significance level is not straightforward.
- significance level too high means that the conclusion that the pattern exists is wrong.
- significance level too low means that the conclusion that the pattern does not exist, is wrong.

Question 7

Q

What are the advantages of machine learning?

Answer

A

Does not always rely on the distribution of your data. Derives a model from your data, instead of proving a model with data. Does explicit validation by counting errors.

Question 8

Q

What are the disadvantages of machine learning?

Answer

A

Requires labeled data to perform explicit validation. There is a risk of overfitting.

Question 9

Q

What are the essential points of statistics?

Answer

A

. . . . .

Question 10

Q

What is significance?

Answer

A

Wat is de kans dat mijn model toeval is, berekend op basis van de distributie van uw data. De data moet normaal verdeeld zijn. Als de data niet normaal verdeeld is, wordt de significantie verkeerd berekend.

Lage significantie betekent dat de kans dat je patroon uit toeval voorkomt groot is. Het resultaat is dus niet te vertrouwen.

Hoge significantie betekent dat de kans dat je patroon uit toeval komt klein is. Het resultaat is dus meer te vertrouwen.

Question 11

Q

How do you calculate precision?

Answer

A

TP/(TP+FP)

Question 12

Q

What are the essential points of machine learning?

Answer

A

Derive model from data.
Explicit validation by counting errors.
Beware of overfitting.

Question 13

Q

What is a model?

Answer

A

Combination of formula to transform input data into output (classification or prediction)

Question 14

Q

How do you detect/check for overfitting?

Answer

A

By using a test set.

Question 15

Q

What is meant by training set?

Answer

A

This is the dataset that is used to train the model.

Question 16

Q

How do you validate a model?

Answer

A

Using a test or validation set. You calculate the performance by counting the errors your model has made. These can derive useful metrics like precision, recall and accuracy.

Question 17

Q

What is a confusion matrix?

Answer

A

A confusion matrix is a matrix that shows the types of errors a model makes.

It shows true positives, false positives, true negatives and false negatives. These can be used to calculate performance metrics.

Question 18

Q

How do you interpret a confusion matrix?

Answer

A

A confusion matrix tells us the performance of the model. It shows us the correct classifications on the main diagonal, and the incorrect classifications on the other diagonal.

This can give us metrics such as accuracy, precision and recall.

Question 19

Q

What can you learn from a confusion matrix?

Answer

A

How well a model performs and what types of errors it makes.

Question 20

Q

What is accuracy? How do you interpret it? What can you learn from it?

Answer

A

Accuracy is the amount of correct predictions a model makes. Caution has to be made when using accuracy metrics against unbalanced datasets. A simple model that always predicts the majority class will also score very well on this metric.

Question 21

Q

What is precision? How do you interpret it? What can you learn from it?

Answer

A

Precision shows us how good the model is at predicting the true positive case. It should be interpreted as the higher the number the better: the higher the number, the fewer cases are misclassified as positive.

Question 22

Q

What is recall? How do you interpret it? What can you learn from it?

Answer

A

Recall should be interpreted as how good is it at identifying positive cases. A high recall means it’s very good at identifying positive cases, a low recall means it misses many of them.

You can learn how good your model is at determining the positive case from it.

Question 23

Q

What kind of problems can you solve with machine learning?

Answer

A

Regression. Classification. Clustering. Association Rule Discovery.

Question 24

Q

Why is data analytics relevant for managers?

Answer

A

Money, money, money. Because it will help you make faster, better decisions. It will help you reduce costs. It will lead you to new products and services.

Question 25

Q

How can value be created from data analytics (multiple ways)?

Answer

A

Marketing: churn prediction, sentiment analysis.
Banking & Insurance: fraud detection, credit scoring.
Retail: recommender systems, shop behaviour.
Production: maintenance optimization.
Logistics: replenishment planning.
HR: CV matching.
Health: imaging, diabetes control, air quality monitoring.
Security: intelligence, smart cameras, crowd monitoring.

Question 26

Q

What is so new about data & analytics?

Answer

A

Nothing specifically. What’s new is that the mass availability of data and computer power at low prices has enabled it for a modern market.

Question 27

Q

What is meant by the trade-off between precision and recall?

Answer

A

It’s typically hard to get both good precision and good recall. It’s usually one or the other: the higher your precision gets, the lower your recall becomes.

Question 28

Q

What is the precision and recall if the model always says yes in a binary classification model?

Answer

A

In a binary classification model where the model always says yes, the precision will be very low, but the recall will be perfect.

(it will predict many false positives, but no false negatives)

Question 29

Q

What is the precision and recall if the model always says no in a binary classification model?

Answer

A

In a binary classification model where the model always says no, the precision will be infinite, but the recall will be 0.

(it will always say no, so there will be no false positives, but there will be many false negatives)

Question 30

Q

What is meant by creative destruction?

Answer

A

The process of industrial mutation that continuously revolutionizes the economic structure from within, incessantly destroying the old one, incessantly creating a new one.

Question 31

Q

What is disruptive technology?

Answer

A

Innovations that significantly alter the ways that consumers, industries and businesses operate. A disruptive technology sweeps away the systems or habits it replaces because it has attributes that are recognizably superior.

Question 32

Q

What is the danger of disruptive technology (from the perspective of society / from the perspective of a company)?

Answer

A

As entire industries are shaken to their core many people become unemployed as established giants lose their upper hand. Society might not be ready to adapt to these disruptive technologies as quickly as they appear. Consequently, many people become unemployed and have little to no chances of finding another job. People that have worked their entire lives in industries that suddenly become irrelevant, might not have the opportunities to begin working in new technologies.

Companies that are established in these traditional technologies and that can’t or won’t change will succumb.

Question 33

Q

What is the opportunity of disruptive technology (from the perspective of society / from the perspective of a company)?

Answer

A

We continuously improve ourselves and the world around us. Society as a whole benefits as technology improves significantly, removing the inferior qualities of prior products and services. Companies also benefit from this as they unlock more opportunities to monetize and create value from these new disruptive technologies.

Question 34

Q

What are the characteristics of disruptive technology?

Answer

A

Radical new products, services, business models
Shake the market, reset the rules
Fast growing new entrants challenging incumbents
Level the playing field

Question 35

Q

What is meant by the data analytical cycle?

Answer

A

Data analytics is a continuous improvement cycle.

Question 36

Q

What are the typical steps of the data analytical cycle, what do we mean with these steps, and what are typical activities within these steps?

Answer

A

Business Case: questions, threads, opportunities, optimizations, new products, new services.
Data Selection & Collection:
- Selection: sources, natural experiment, ..
- Collection: experiment, enterprise information system, external data/ databases, web crawling/ scraping, web services/APIs
Data Preparation: cleaning, outlier removal, missing values, wrangling, reduction, feature selection, feature extraction.
Explorative Analytics: visual exploration.
- unsupervised machine learning: clustering, association rule mining.
Predictive Modelling: supervised machine learning.
Interpretation & Action: insights, decisions, operational deployment.

Question 37

Q

What is meant by the art of data analytics?

Answer

A

Training a model is easy. But, garbage in is garbage out. The selection of training data and the selection of models and parameters is crucial. This requires a more creative approach.

Question 38

Q

Why is data science not only science but also art?

Answer

A

Training a model is easy. But, garbage in is garbage out. The selection of training data and the selection of models and parameters is crucial. This requires a more creative approach.

Question 39

Q

Why can you use more techniques to find patterns with machine learning compared to statistics?

Answer

A

Because machine learning is not always dependent on the data being normally distributed.

Question 40

Q

What kind of problems can you solve with machine learning?

Answer

A

Regression, Classification, Clustering, Association Rule Discovery

Question 41

Q

Which types of problems are supervised?

Answer

A

Regression/ Classification

Question 42

Q

Which types of problems are unsupervised?

Answer

A

Clustering/ Association Rule Discovery

Question 43

Q

How do you decide to choose for supervised or unsupervised techniques?

Answer

A

It depends on the business case, it depends on the result we’re trying to achieve.

Question 44

Q

What is clustering and when to use it?

Answer

A

Clustering is the techniques used to segment your data into groups. You use it when you don’t know the labels of your data, or when you’re trying to identify groups…

Question 45

Q

What is clustering and when to use it?

Answer

A

Split or group cases/observations.

Question 46

Q

What is association rule discovery and when to use it?

Answer

A

Discover events that happen together. e.g. recommendation systems, market basket analysis.

Question 47

Q

What is estimation/prediction and when to use it?

Answer

A

Predict a continuous value: e.g. a house price.

Question 48

Q

What is classification and when to use it?

Answer

A

Predict a categorical value: e.g. male/female.

Question 49

Q

What is a recommender system?

Answer

A

Recommender systems are used to determine what items a customer would be interested based on prior items they liked or frequently buy. For example, Netflix uses this to recommend movies you might enjoy to keep you on its platform for longer.

Question 50

Q

What is market basket analysis?

Answer

A

This is a technique used by retails to uncover associations between items.

It works by looking for combinations of items that occur together frequently in transactions. To put it another way, it allows retailers to identify relationships between the items that people buy.
Association Rules are widely used to analyze retail basket or transaction data, and are intended to identify strong rules discovered in transaction data using measures of interestingness, based on the concept of strong rules.

Question 51

Q

Why is parallel processing the solution to the performance problems in big data?

Answer

A

Because we’re reaching the limits of how much we can optimize the hardware. Computers and Processors aren’t getting much faster/smaller anymore. As such, we need new ways of improving performance, which can be achieved through parallel processing.

Question 52

Q

Why do we need to adapt software to make use of parallel processing?

Answer

A

Because software needs to be explicitly programmed to deal with the intricacies of parallel programming. Instructions need to be laid out specifically, data needs to be managed more carefully.

Most software requires explicit changes to deal with parallel processing.

Question 53

Q

How can we make use of parallel processing in data analytics without having to adapt software?

Answer

A

We can train multiple models side-by-side.

We can train multiple models with different parameters side-by-side. This reduces the lead time of one task. (e.g. K-Means clustering with different numbers of clusters and different start positions.)

We can split our data in multiple parts and train separate models on separate datasets. For Classification problems, these resulting models can be recombined. This is impossible to do for clustering, where we might discover different clusters on different data subsets.

Question 54

Q

How can we make use of parallel processing in data analytics without having to adapt software?

Answer

A

We can train multiple models side-by-side.

We can train multiple models with different parameters side-by-side. This reduces the lead time of one task. (e.g. K-Means clustering with different numbers of clusters and different start positions.)

We can split our data in multiple parts and train separate models on separate datasets. For Classification problems, these resulting models can be recombined. This is impossible to do for clustering, where we might discover different clusters on different data subsets.

Question 55

Q

Why will a program adapted for parallelization not run N times faster on N processors?

Answer

A

Because of the computing overhead of splitting the data, moving/copying data across different processes, reassembling the data, etc.

Question 56

Q

Why will a computer that is 100 times faster not solve all performance problems?

Answer

A

Because some problems scale exponentially. For a 10-fold increase in data, the problem scales 100-times. In these cases, better software/algorithms are the answer.

Question 57

Q

What is meant by selection bias?

Answer

A

Making assumptions on the wrong/unrepresentative data.

Example: you could say that all hospitals are inherently dangerous because most people who die, die at hospitals. This is not true, it’s just that most people who are injured or ill go to a hospital.

Question 58

Q

What kind of role can managers play in the data analytics cycle?

Answer

A

Providing the business case, aiding in the data selection and assisting in the interpretation and decision-making steps. Being the translation step between the technology side and the business side.

Question 59

Q

What is meant by spurious correlation?

Answer

A

A correlation that is caused by random chance or by a third (unseen) factor.

Question 60

Q

What is process mining?

Answer

A

Process mining takes existing data records as a starting point, extracts different variations of the process and automatically turns them into understandable visualizations.

This can show remarkable deviations, unnecessary rework and the real bottlenecks.

Question 61

Q

What kind of data does process mining use?

Answer

A

Existing data records/event logs.

Question 62

Q

What can you do with discovery process mining?

Answer

A

Reverse Engineering: derive process model from event log.

- Decision Mining: check how decisions are made.

Question 63

Q

What can you do with conformance checking process mining?

Answer

A

Auditing/Testing: compare real process flows with intended process flows.

Question 64

Q

What can you do with performance mining?

Answer

A

Optimization: add additional data (e.g. waiting and process times) to get additional insights (e.g. bottlenecks).

Answer 65

A

Network mining is analysing the connections between people and/or objects to uncover their relations.

Answer 66

A

Contacts (emails, phone calls, Facebook)
Transactions (financial, trade, fraud)
Citations (scientific, …)
Co-occurrence
Collaboration

Answer 67

A

Gain insights in communities and networks. This can unravel patterns, weaknesses, optimizations, etc.

Answer 68

A

You can discover the most crucial points in your network. For example, in social networks you could identify the social influencers which can be targeted for your marketing campaigns. If you can convince these people, they will convince the people in their networks.

Answer 69

A

You can discover the kinds of groups you should be targetting. How these groups interact, how these groups interact with other groups, …

Answer 70

A

Marketing

Segmentation/ communities
Influencers

Bottlenecks and load balancing

Physical networks
Processes
People

Fraud detection

Anti-terrorism, anti-espionage, crime

Disease control

Collaboration

Behaviour analysis

Humans
Animals
Plants

Answer 71

A