Lectures (week 1) Flashcards

Question 1

Q

What are the 3 types of data?

Answer

A

numerical
Ordinal data is comparable and can thus be used for
sorting. It cannot be added or averaged.
* The best way to use ordinal data is to convert it into
numerical data. You should, choose, using domain
knowledge, a translations that preserves information
categorical: Categorical data uses labels to split the datapoints into distinct
categories. Sometimes there are only a few labels (gender)
sometimes there are many.
While most labels can be sorted alphabetically, you should not
treat it as ordinal unless domain experts say that there is a
meaningful order

Question 2

Q

How to translate categorical data into numerical data using onehot/dummies?

Answer

A

Onehot-encoding or dummies-encoding
is the best way to convert categorical
data to numerical data while keeping all
information.
* It introduces a new column for each
distinct label. The column contains 1.0 if
the item falls into the category and 0.0
otherwise.
* It leads to a dimensional explosion, you
should consider clustering/grouping
before encoding is some cases

Question 3

Q

What is semi-numerical data?

Answer

A

Semi-numerical-data is data
that looks numerical, but
should not be treated as such
according to domain
knowledge.

Question 4

Q

What are the different types of problema?

Answer

A

Problem types
a) Classification : In a classification problem you put the correct
label from a finite set of labels on a datapoint, Decision problems are often binary classification

b) Regression (‘value prediction’)
c) Clustering
d) Decision
e) Recommendation: In recommendation, you have a large number of items and must select a few top results

Question 5

Q

How is the accuracy of a classification algorithm defined?

Answer

A

The accuracy of a classification algorithm is a percentage, defined as
#<correct> / #< answers>
Answers can be incorrect in two ways:
● False negative: algorithm says ’no’, actual answer is yes
● False positive: algorithm says ’yes’, actual answer is no
Similarly, there are two types of correct answers:
● True negative: algorithm says ’no’, actual answer is no
● True positive: algorithm says ’yes’, actual answer is yes</correct>

Question 6

Q

When to use precision and when recall>?

Answer

A

Precision is important if the cost of a false positive is high. This is for instance the case in hiring for
popular positions, or in case of high risk treatment in non-urgent cases
* Recall is important if the cost of false negatives are high, e.g. medical screening

Question 7

Q

When to use a f1-score?

Answer

A

It is a smart combination of precision and recall so that both
types of errors are minimized. It is like accuracy, but works much better for unbalanced
decision problems

Question 8

Q

What is the difference between average, median and modus>/

Answer

A

Average: Add all entries and divide by the number of numbers added. You
can do averages over subgroups and over both dimensions.
* Median: The middle value if you would sort by value. The median is less
sensitive to outliers, such as extreme incomes. It is a better way to find the
most regular value
* Modus: the most common value. The modus is also defined for noncomparable entities. E.g. the most common favourite color or most common
password. There is no average or median password.

Question 9

Q

Lectures (week 1) Flashcards

(9 cards)