Lectures (week 1) Flashcards

1
Q

What are the 3 types of data?

A
  1. numerical
  2. Ordinal data is comparable and can thus be used for
    sorting. It cannot be added or averaged.
    * The best way to use ordinal data is to convert it into
    numerical data. You should, choose, using domain
    knowledge, a translations that preserves information
  3. categorical: Categorical data uses labels to split the datapoints into distinct
    categories. Sometimes there are only a few labels (gender)
    sometimes there are many.
    While most labels can be sorted alphabetically, you should not
    treat it as ordinal unless domain experts say that there is a
    meaningful order
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How to translate categorical data into numerical data using onehot/dummies?

A

Onehot-encoding or dummies-encoding
is the best way to convert categorical
data to numerical data while keeping all
information.
* It introduces a new column for each
distinct label. The column contains 1.0 if
the item falls into the category and 0.0
otherwise.
* It leads to a dimensional explosion, you
should consider clustering/grouping
before encoding is some cases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is semi-numerical data?

A

Semi-numerical-data is data
that looks numerical, but
should not be treated as such
according to domain
knowledge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the different types of problema?

A

Problem types
a) Classification : In a classification problem you put the correct
label from a finite set of labels on a datapoint, Decision problems are often binary classification

b) Regression (‘value prediction’)
c) Clustering
d) Decision
e) Recommendation: In recommendation, you have a large number of items and must select a few top results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How is the accuracy of a classification algorithm defined?

A

The accuracy of a classification algorithm is a percentage, defined as
#<correct> / #< answers>
Answers can be incorrect in two ways:
● False negative: algorithm says ’no’, actual answer is yes
● False positive: algorithm says ’yes’, actual answer is no
Similarly, there are two types of correct answers:
● True negative: algorithm says ’no’, actual answer is no
● True positive: algorithm says ’yes’, actual answer is yes</correct>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When to use precision and when recall>?

A

Precision is important if the cost of a false positive is high. This is for instance the case in hiring for
popular positions, or in case of high risk treatment in non-urgent cases
* Recall is important if the cost of false negatives are high, e.g. medical screening

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

When to use a f1-score?

A

It is a smart combination of precision and recall so that both
types of errors are minimized. It is like accuracy, but works much better for unbalanced
decision problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the difference between average, median and modus>/

A

Average: Add all entries and divide by the number of numbers added. You
can do averages over subgroups and over both dimensions.
* Median: The middle value if you would sort by value. The median is less
sensitive to outliers, such as extreme incomes. It is a better way to find the
most regular value
* Modus: the most common value. The modus is also defined for noncomparable entities. E.g. the most common favourite color or most common
password. There is no average or median password.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly