Danny's Zenne Stof Flashcards

Question 1

Q

What is data analytics in exploratory analytics?

Answer

A

It’s about the extraction of useful information and knowledge from large volumes of data, in order to improve decision making.

Question 2

Q

Why do we do data exploration?

Answer

A

We explore our data in order to understand it better.

Question 3

Q

How do you get started on data exploration?

Answer

A

Import the data in the right format.
Understand the meaning of the variables.
Understand their typical values.
Understand how values interact with each other.
Understand how to combine different datasets.
Understand the data types.
Are there missing values?
Are there outliers?
What is the overall quality of our data.

Question 4

Q

Exploring data is two-folded: explain.

Answer

A

Descriptive statistics: distributions, relationships, …

Visualizations: scatterplots, histograms, barplots, …

Data visualization is arguably the most important. The information conveyed via visuals can be very quickly absorbed by the human brain.

Question 5

Q

You cannot prepare the data without understanding the data. Explain?

Answer

A

You need to know the quality of your data to know what to do in your preparation steps. Are there outliers, missing values, …etc.

Question 6

Q

A good star is half the battle… why?

Answer

A

You can’t begin working on your project unless you know and understand your data.

Question 7

Q

Datasets typically consist of rows and columns. What do they mean?

Answer

A

Rows are the observations/data points/entities.

Columns are the attributes/features/attributes/variables of your observations.

Question 8

Q

Which kinds of data sources are there?

Answer

A

Internal and External.

Question 9

Q

Give some examples of internal data sources.

Answer

A

Company Website
Customer Information: make sure to contact the privacy responsible before working with Personally Identifiable Information! (GDPR)
Operations/Logistics data
Financial data

Question 10

Q

Give some examples of external data sources

Answer

A

APIs (e.g. tweets)
Public Records (open source data, available to anyone, e.g. government)
Manually Labelled (e.g. reCaptcha, labeled customer reviews, …)

Question 11

Q

What kinds of data storage exists?

Answer

A

Servers on Premise (small- to medium-sized datasets)
Cloud (any kind of dataset)

e.g. Amazon AWS, Google Cloud, Azure Cloud, …

Question 12

Q

What kinds of data do you have? Give some examples.

Answer

A

Structured:

Tabular data
Customer information
Transactional data

Unstructured:

Text
Email
Video
Audio
Web Pages
Social Media

Question 13

Q

What kind of databases are used for structured data?

Answer

A

Relational databases.

Question 14

Q

What kind of databases are used for unstructured data?

Answer

A

Document databases.

Question 15

Q

What query language is used to access document databases?

Question 16

Q

What query language is used to access relational databases?

Question 17

Q

How can we turn unstructured data into structured data?

Answer

A

By means of feature extraction.

Question 18

Q

What is character encoding and why is it important?

Answer

A

Character encoding is used to tell the software how to interpret the bytes of your data. This is important to that your data is accurately/correctly interpreted.

Default encodings include UTF-8 and Latin1. Latin1 cannot interpret Kanji, for example.

Question 19

Q

What are missing values?

Answer

A

Missing values are values that are missing from your dataset.

Question 20

Q

What are some important steps to consider when importing data?

Answer

A

Are we using the correct character encoding?

- Are there any missing values?

Question 21

Q

What types of data are there?

Answer

A

Categorical:

Nominal (unranked)
Ordinal (ranked)

Numerical:

Discrete (counted, not measured)
Continuous (measured, not counted)

Question 22

Q

What is nominal data and give some examples.

Answer

A

Categorical data that does not indicate an order between the values.

Male/Female
Colours (red, green, blue)

Question 23

Q

What is ordinal data and give some examples.

Answer

A

Categorical data that does have some kind of order.

Small, Medium, Large
First Class, Second Class, Third Class
Temperature labeled as “cold, mild, hot”

Question 24

Q

What is continuous data and give some examples.

Answer

A

Continuous data is data that can be measured, but not counted.

Length
Weight

Question 25

Q

What is discrete data and give some examples.

Answer

A

Discrete data can be counted, but not measured.

Number of students
Number of pens in the box.
Number of chickens that walked out of the chicken coop.

Question 26

Q

What kind of statistics can you do with nominal data?

Answer

A

You can count the frequencies.

You can count the proportions.

Question 27

Q

How can you visualize nominal data?

Answer

A

Barcharts and piecharts.

Question 28

Q

What kind of statistics can you do with ordinal data?

Answer

A

Frequencies, proportions.

Percentiles and median.

Question 29

Q

What kind of statistics can you do with continuous and discrete data?

Answer

A

You can summarize your data using percentiles, median, mean, standard deviation, range …

Question 30

Q

How can you visualize numeric data?

Answer

A

Histograms.

Boxplots.

Question 31

Q

Which type of plot can show outliers? Histograms or Boxplots?

Answer

A

Boxplots. Histograms only show tendencies of your data, not individual outliers.

Question 32

Q

What do you call a variable that identifies a sample?

Answer

A

An object identifier.

Question 33

Q

Give some examples of object identifiers.

Answer

A

Row indexes, names, database ids.

Question 34

Q

What kind of information does a histogram give you?

Answer

A

The general tendencies of your data.

Question 35

Q

What are descriptive statistics?

Answer

A

Descriptive statistics give you insights by summarizing the data.

Question 36

Q

Give some examples of descriptive statistics.

Answer

A

Average of the annual income.
Median home prices in the neighbourhood.
Range of credit scores of a population.

Question 37

Q

What is univariate exploration?

Answer

A

This is the analysis of one attribute at a time.

Question 38

Q

What is the mean?

Answer

A

This is the average of all observations in a dataset for a certain variable.

Question 39

Q

What is the median?

Answer

A

This is the value of the central point in the distribution of the dataset for a certain variable.

Question 40

Q

What is variability?

Answer

A

Variability is the range between which valid values are posed. For instance, two ranges with similar means and median values can have vastly different variabilities if their minimums and maximums are different.

Question 41

Q

What is range?

Answer

A

Range is the difference between the minimum and maximum value.

The range is very susceptible to the presence of outliers and fails to consider the distribution of all data points in the attribute.

Question 42

Q

What is spread?

Answer

A

Spread is quantified by the deviation and variance.

Question 43

Q

What is deviation?

Answer

A

The difference between the observation and the mean of a value.

Question 44

Q

What is variance?

Answer

A

Variance is the squared deviation of a variable from its mean.

Question 45

Q

What is standard deviation?

Answer

A

The squared deviation of the variance.

Question 46

Q

What does it mean where an attribute has a high standard deviation?

Answer

A

The datapoints are spread widely from its central point.

Question 47

Q

What does it mean when an attribute has a low standard deviation?

Answer

A

It means that the datapoints are spread closely around the central point.

Question 48

Q

What is multivariate exploration?

Answer

A

It means that we study more than one attribute simultaneously.

Question 49

Q

What is correlation?

Answer

A

Correlation measures the statistical relationship between two attributes.

Question 50

Q

What is spurious correlation?

Answer

A

A correlation that happens by accident, or because of an (unseen) third factor.

It’s a correlation that’s not causal.

Question 51

Q

What is the pearson correlation coefficient?

Answer

A

A value (r) that can be between -1 and 1. It describes how strongly correlated two variables are.

-1 : strongly negatively correlated
1 : strongly positively correlated
0 : no correlation at all

Question 52

Q

Pearson’s correlation coefficient is sensitive to outliers. Correct?

Question 53

Q

What do we use scatterplots for?

Answer

A

We use scatterplots to compare 2 numerical attributes. We can compare more attributes by using colours, shapes, etc. to plot a third attribute.

Question 54

Q

What is a histogram? What do you use it for?

Answer

A

A histogram can be used to visualize the distribution of data by plotting the frequency of occurrence in a range.

Question 55

Q

What’s the optimal number of bins or binwidth in a histogram?

Answer

A

There is no optimal number, it depends on the data.

Question 56

Q

How can we compare the histograms of a categorical third factor?

Answer

A

By using colours. This could be useful to see how the X and Y attribute compares for various values of a third categorical variable.

Question 57

Q

What is a boxplot?

Answer

A

A boxplot is a simple but powerful visual way of showing the distribution of a numerical variable. A boxplot shows useful information like outliers and interquartiles.

Question 58

Q

What makes boxplots interesting?

Answer

A

You can compare them easily.

Question 59

Q

What is Q1, Q2 and Q3 in a boxplot?

Answer

A

Q1 and Q3 indicate the edges of the box. Q2 indicates the mean of the distribution.

Question 60

Q

What is R²?

Answer

A

The model fit. A higher number indicates a better model fit.

Question 61

Q

Where are the data samples located on a linear regression between two highly correlated numerical variables?

Answer

A

Very close to the linear regression line.

Question 62

Q

Where are the data samples located on a linear regression between two lowly correlated numerical variables?

Answer

A

Very scattered and not along the linear regression line.

Question 63

Q

Do outliers strongly influence the linear regression calculation?

Question 64

Q

What is a scatter matrix?

Answer

A

It’s annoying to calculate scatterplots for each numerical attribute in datasets with many numerical features.

You can use a scatter matrix to quickly show comparisons for all of them.

A scatter matrix will show scatter plots for each pair of attributes below the main diagonal.

The main diagonal will show histograms of the attribute it represents.

Above the main diagonal will be the r-value that shows how correlated it is.

Answer 61

A

In what way does type X differ from type Y.

Which classes have the highest numerical values for attribute A?

How do classes X and Y compare for attribute B and C?

Answer 62

A

Import & Organize
Data Quality

Check the data quality. Are there missing or incorrect values? How will you deal with it?

This is an iterative process with data scientists & the business.

Values might have to be imputed.

Univariate Statistics

Calculate mean and median for each numerical attribute and the class label.

If they are very different, it may indicate the presence of an outlier or a non-normal distribution for the attribute.

Calculate the standard deviation and spread. Compare the standard deviation with the mean to understand the data.

Univariate Visualizations

Display the histogram and distribution plots for each attribute. Repeat for class-stratified histograms. Use colour coding for each class to make comparisons.

Compare with and without outliers.

Multivariate Statistics

Calculate correlation between attributes and develop a correlation matrix. Notice what attributes are dependent on each other. Investigate why they are dependent. Ask your business for help to explain these results.

Multivariate Visualizations

Plot a scattermatrix to show correlation between multiple attributes at once.

Remember to stratify by class if applicable.

High Dimensional Visualization

Create parallel charts to observe the class differences exhibited by each attribute.

Group box plots to compare them for each attribute.

Answer 63

A

Each decision tree has a root node at the top of the decision tree.

Answer 64

A

A node that has child nodes.

Answer 65

A

A node without child nodes.

Answer 66

A

Exploration. Gain insights in large number of candidate input variables.

Classification & Estimation. Easy understandable rules for predicting most likely classes or value of continuous variable.

Answer 67

A

Because they provide an insight in their decision making. They are a white-box approach.

Answer 68

A

Each decision tree starts with a root node and asks a question: for example “sex = male?”.

Each node contains 3 rows of information.
Row 1: has 1 number, the majority class of the class we’re trying to predict.
Row 2: two numbers, the proportion of observations belonging to each class.
Row 3: the % of total observations in this node.

Example from the Titanic case:

0

Answer 69

A

This indicates how pure a node is. A node is pure if it contains many of one class, but not many of the other class.

Answer 70

A

By using purity. A decision tree automatically tries to find the best splits by feature and value of the feature to make its splits as pure as possible.

This means that each split contains one dominant class.

Answer 71

A

It’s an exhaustive algorithm because it tries all possibilities to make the best mathematical decision of purity.

Answer 72

A

An algorithm that will continue to apply itself again and again until it completes its task.

Answer 73

A

Using GINI, Entropy or Information Gain Ratio.

Answer 74

A

Splits are evaluated based on the effect on the node purity in terms of the target variable.

This means that the choice of an appropriate spltiting criteria depends on the type of the target variable.

With a categorical -> GINI is OK.

Continuous/Numeric -> other tests.

Answer 75

A

GINI,
Entropy,
Information Gain Ratio

Answer 76

A

GINI is the sum of squares of the proportions of the classes.

Answer 77

A

(5/13)² + (8/13)² = 0.527

Answer 78

A

GINI1 = (6/7)² + (1/7)² = 0.755
GINI2 = (2/6)² + (4/6)² = 0.556

GINI(split) = (7/13)0.755 + (6/13)0.556 = 0.755

Answer 79

A

If it can no longer split the data, otherwise it will keep growing.

Answer 80

A

Eliminating unstable splits by merging smaller splits.

Answer 81

A

The subtle patterns in the training set. They do not generalize well.

Answer 82

A

Because they make lumpy estimations.

Answer 83

A

You can, but in practice it might not give good results. They are better suited for binary classifications.

Answer 84

A

Mathemetical function of a set of numeric attributes.

Answer 85

A

These are models that learn by tweaking a set of parameters that are closely tied to the attributes that are used for learning.

For example, y = a + bx1 + cx2. In this case, x1 and x2 are the chosen attributes of our training set. The variables a, b and c are chosen by the model in such a way that the mathematical result has the best result on the training set (the lowest cost).

Answer 86

A

This is the line you can draw through a dataset in order to determine the target class.

This is also a mathematical function, like:

Class(x) = 400 - 5 * size (where 400 and 5 were determined by the model).

This can then say that if the result is < 0, it’s the negative class and if it’s > 0 it’s the positive class.

Answer 87

A

The discriminant is a line.

Answer 88

A

Hyperplane.

Answer 89

A

No, because it has more than 3 dimensions.

Answer 90

A

These are the algorithms you use when building a parametric model.

Example: SVM, Logistic Regression, Linear Regression, …

Answer 91

A

Determining how likely (probability) of belonging to a class.

Answer 92

A

It classifies observations based on the linear function of the features.

Answer 93

A

The data points that are closest to the linear line.

Answer 94

A

Fitting the fattest line between the classes.

Answer 95

A

Wider margin.

Answer 96

A

It penalizes the model for them. Distance increases penalty. Further away from the line and on the wrong side => bigger penalty.

Answer 97

A

Accuracy.
Works good on small clean datasets.
Robust against overfitting.

Answer 98

A

Computational power for large datasets.

Answer 99

A

Root Mean Squared Error. It is used in regression to estimate how well your model performs. You want your model to fit its parameters in such a way that minimizes the RMSE on the training and test sets.

Answer 100

A

Easy to fit and apply.

Less prone to overfitting.

Interpretable.

Answer 101

A

They can only express linear and additive relationships.

Prone to colinearity - when input variables are partially correlated.

Sensitive to outliers.

Answer 102

A

That the model isn’t very good, not really much better than guessing the answer.

Answer 103

A

That the model is very good at predicting the value. It’s a good fit for the data.

Answer 104

A

x = y line runs roughly through the center points

Answer 105

A

Points are diverting from the x = y line.

> need to model more complex relationships
> not all necessary variables are included (model is too simple)

Answer 106

A

It shows the goodness of fit. It shows the distances between the real value and the model. Minimizes the sum of square errors, indicating a better model as this value gets minimized.

Answer 107

A

A model that indicates the probability that an observation belongs to the class of interest.

Answer 108

A

It needs numerical input variables and a categorical output variable.

Answer 109

A

It’s the confidence interval??

Answer 110

A

They’re not sensitive to outliers. Only points around the classification boundary (the middle) have a large influence on the model.

Answer 111

A

Yes, but it doesn’t tell us much as it’s very sensitive to outliers, so we preferably don’t do it.

Answer 112

A

Example: if you cancel your cable TV account.

Answer 113

A

Example: you stop shopping at Delhaize and go to Colruyt instead.

Answer 114

A

When the user decides to stop doing business.

Example: doesn’t use their TV, so doesn’t want to pay anymore.

Answer 115

A

When the company decides to stop doing business with the customer.

Example: customer violates terms of service and company bans the customer.

Answer 116

A

If the cost of making false positives is high: for example, if you want to make an effort to target churners and the cost of targeting a non-churner is high.

Example: you could offer better benefits to potential churners, if you start handing these out nilly-willy, it costs your company a lot of money.

Answer 117

A

If the cost of not identifying your positive class is high.

Example: if losing customers you didn’t think would churn is more expensive than acquiring new ones.

You don’t want to misclassify “real” churners.

Answer 118

A

True Positive Rate over False Positive Rate.

Answer 119

A

The one that’s closest to the top left corner, the model with the biggest area-under-curve (AUC).

Answer 120

A

This is the combined value of precision and recall. It’s a harmonic mean between these two values.

A high F1-score means you have a very well performing model, even with unbalanced classes.

Answer 121

A

In parallel.

Answer 122

A

By building on the results of the previous classifier sequentially.

Answer 123

A

Most machine learning algorithms make assumptions about the distribution of your data.

Variables should be on the same scale (= standardization)

Variables should be of numeric types.

Answer 124

A

Models that use linear objective functions.

Answer 125

A

You use binary or one hot encoding.

Answer 126

A

Dropping unnecessary features.

Dropping highly correlated features.

Answer 127

A

Use stratified sampling.

Answer 128

A

Create new features based on existing features.

Creates insights into relationships between features.

Should consult with business and subject matter experts.

Answer 129

A

We mean: how similar or dissimilar are two data points?

Answer 130

A

d(i, j) = p-m / p

m = number of matches
p = total number of attributes describing the object

Answer 131

A

1 - d(i, j)

Answer 132

A

Use the Euclidian distance:

d(i,j) = sqrt( (xi1 - xj1)² + (xi2 - xj2)² + … + (xin - xjn)² )

We want to minimise dissimilarity => more similar!

Answer 133

A

Because large numbers swamp small numbers making them very unimportant.

Answer 134

A

It’s mean is 0 and it’s standard deviation is 1.

Answer 135

A

1-Nearest Neighbours.

Answer 136

A

Test accuracy for multiple K’s and use the best result. Alternatively, plot test/train errors on graph. When test data errors start going up again when training still goes down –» overfitting.

Answer 137

A

Euclidian distance

Answer 138

A

Every new label will receive the label of the majority class.

Answer 139

A

It overfits very hard.

Answer 140

A

Very little time to build.

Can handle missing values in new observation.

Answer 141

A

Storage.
No model description.
Curse of dimensionality. There could be many irrelevant attributes that might pose a problem for predicting a neighbour.

Answer 142

A

Instance-based learning.

Answer 143

A

We might want to find groups within our data, without looking for any specific classification.

Answer 144

A

Unsupervised segmentation.

Answer 145

A

Number of clusters not known ahead of time.

Answer 146

A

To indicate hierarchies in clusters.

Answer 147

A

Each data point.

Answer 148

A

The entire dataset.

Answer 149

A

First, it groups the two data points that are closest together. Then, it groups the second 2 closest together. It keeps doing that until it has grouped everything in one large dataset group.

It groups based on distances.

Answer 150

A

It creates a collection of ways to group points. You can cut at a certain horizontal point (which indicates the number of groups).

Answer 151

A

It allows the data scientist to see the groups, the landscape of data similarity before deciding on the amount of groups to select.

Answer 152

A

The center of a cluster.

Answer 153

A

Breaks observations into pre-defined number of clusters. K = number of clusters.

Answer 154

A

The cluster centroids and the clustered datapoints.

Answer 155

A

Randomly assign each of observation to one of the K clusters.
Then, calculate the centers for each clusters. The center is the average position of all observations of that cluster.
Then, we have the first iteration. Each observation is assigned to the first cluster center.
Then, recalculate the center of the cluster.
Keep repeating until the center stops changing.

Stopping can also be done early by specifying how many times it can change the center.