Lecture 2 - Data Understanding I Flashcards

1
Q

What are the goals of data understanding?

A
  • Gain general insight
  • Check the assumptions
  • Check the specific domain knowledge
  • Suitability of data for goals
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the rows and columns called?

A

Rows: Instances, records, samples, data objects
Columns: Features, attributes, variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the types of attributes?

A
  • Categorical (nominal): Classes or categories {Finnish, Swedish}
  • Ordinal: Linear order {BSc, MSc, PhD}
  • Numerical: Values are numbers
  • Discrete: Categorical attribute or numerical attribute whose domain is a subset of the integer number
  • Interval scale: the definition of value 0 is arbitrary, ratios are meaningless (like celsius)
  • Ratio scale: 0 has a canonical meaning, ratios make sense (distance)
  • Absolute scale (for example number of people)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does the level of granularity mean?

A

More defined information, for example:
Drinks -> water -> producer -> size 0.33l
More detailed has more information, but might not help discover general associations (like wine and cheese bought together)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What can be a probelm for categorical attributes?

A

Dynamic domains: the possible values, like the products name or amount of sales for new product can change over time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a categorical variable with two possible values?

A

Binary attribute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Does the order of binary attribute matter?

A

No, can be 1 = FALSE 0 = TRUE or 1 = TRUE 0 = FALSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How can you encode categorical attributes? Example of {Finland, Sweden, Denmark, Norway}.

A

One hot encoding.
Can’t be 1,2,3,4 because there is no order.
Transform each into binary variable.
isFIN = 0/1, isSWE = 0/1 etc.
Categorical feature with k possible values can be transformed into:
Fin = 1000, Swe = 0100, Den = 0010, Norway = 0001
Then people like Björn can be 1100, because they are FinSwe

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can you encode ordinal attributes?

A

There is order, so {BSc, MSc, PhD} = {1,2,3}.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is syntactic accuracy?

A

It is the accuracy that the entry belongs to the domain of the attribute. For example entry fmale for sex violates synctactic accuracy. Text entries “one” for numerical data 1 also violate synctactic accuracy.
Can be checked quite easily

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is semantic accuracy?

A

Semantic accuracy means whether or not the entry is correct even though it correctly belongs to the domain of attribute. For example PhD and age 10 can be correct for studies and age, but it’s not possible to be PhD at the age of 10.
Hard to check.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does completeness of the data mean?

A

It means that there is no missing entries, for example null entries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is unbiased and representative data?

A

The data contains all information about the inherent pattern and rules in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why is it sometimes hard to get unbiased and representative data?

A

Machine monitoring: nuclear power plant runs a lot normally, you can’t manually make it run terribly.
Natural disasters: How to predict earthquake in a place where there has not been earthquakes?
Mortgage/insurance: Certain types of customers are missing from the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What causes missing values?

A
  • Broken sensors
  • Refusal to answer
  • Test not done for person
  • Combined data with different variables between databases
  • Irrelevant attribute (gas/l for tesla)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the types of missing values?

A

MCAR (Missing completely at random, aka OAR observed at random), MAR (Missing at random) and nonignorable

17
Q

What does MCAR mean?

A

Missing completely at random. The probability that a value for X is missing does not depend on the true value of X or on other variables.

For example battery of a sensor runs out and is replaced at random times.

18
Q

What is the probability of MCAR at each instance?

A

The same, missing values follow same distribution as observed

19
Q

What methods can be used for MCAR values?

A

Data imputation or ignoring will not bias the analysis, but often MCAR is not realistic assumption.

20
Q

What is MAR?

A

Missing at random. The probability that value for X is missing depends on Y. For example battery sensor runs out, and it is replaced during specific times. Or higher earning people are less likely to report their income

21
Q

What distribution MAR values follow?

A

Not the same as the distribution for the observed values, so the analysis can be biased.

22
Q

What methods for MAR values?

A

We can estimate values for day, if values from night are missing, but the problem is that can we be sure that Y provides all the required information?

23
Q

What is nonignorable missing value?

A

The probability that value X is missing depends on the true value of X.

For example temperature sensor doesn’t work at below zero, person with over 1M€ income does not report their income.

24
Q

Can nonignorable values be estimated?

A

No, it is impossible to provide any statements about temperatures below 0, if the sensor never works at below 0.

25
Q

What strategy can be applied, if the domain knowledge doesn’t tell the kind of missing values we can expect?

A
  1. Turn the considered attribute X to binary (all measured to YES and all missing to NO)
  2. Build a classifier with binary attribute X as the target and use the other attributes for the prediction of YES/NO
  3. Determine the misclassification rate.
    - MCAR: other attributes should not provide any information, so the misclassification should not differ from pure guessing.
    - If the misclassification is a lot better than pure guessing, then other attributes affect X, so it is not MCAR. But MCAR and MAR cannot be distinguished from nonignorable this way.
26
Q

Why is data visualisation imoprtant for data understanding and data quality evaluation?

A

Visualisations reveal patterns or exceptions, there is “something” in the dataset

27
Q

What causes outliers?

A
  • Data quality problems, like erroneous data from wrong measurements
  • Exceptional or unusual situations
28
Q

What should be done to outliers?

A

Erroneous outliers should be excluded from the analysis. Correct outliers (aka exceptional situations) are sometimes useful to exclude also.

29
Q

Can outliers be useful?

A

Yes, for example “whales” in mobile games.