Lecture 2 - Data Understanding I Flashcards by Santeri Isotalo

What are the goals of data understanding?

Gain general insight
Check the assumptions
Check the specific domain knowledge
Suitability of data for goals

How well did you know this?

Not at all

Perfectly

What are the rows and columns called?

Rows: Instances, records, samples, data objects
Columns: Features, attributes, variables

How well did you know this?

Not at all

Perfectly

What are the types of attributes?

Categorical (nominal): Classes or categories {Finnish, Swedish}
Ordinal: Linear order {BSc, MSc, PhD}
Numerical: Values are numbers
Discrete: Categorical attribute or numerical attribute whose domain is a subset of the integer number
Interval scale: the definition of value 0 is arbitrary, ratios are meaningless (like celsius)
Ratio scale: 0 has a canonical meaning, ratios make sense (distance)
Absolute scale (for example number of people)

How well did you know this?

Not at all

Perfectly

What does the level of granularity mean?

More defined information, for example:
Drinks -> water -> producer -> size 0.33l
More detailed has more information, but might not help discover general associations (like wine and cheese bought together)

How well did you know this?

Not at all

Perfectly

What can be a probelm for categorical attributes?

Dynamic domains: the possible values, like the products name or amount of sales for new product can change over time

How well did you know this?

Not at all

Perfectly

What is a categorical variable with two possible values?

Binary attribute

How well did you know this?

Not at all

Perfectly

Does the order of binary attribute matter?

No, can be 1 = FALSE 0 = TRUE or 1 = TRUE 0 = FALSE

How well did you know this?

Not at all

Perfectly

How can you encode categorical attributes? Example of {Finland, Sweden, Denmark, Norway}.

One hot encoding.
Can’t be 1,2,3,4 because there is no order.
Transform each into binary variable.
isFIN = 0/1, isSWE = 0/1 etc.
Categorical feature with k possible values can be transformed into:
Fin = 1000, Swe = 0100, Den = 0010, Norway = 0001
Then people like Björn can be 1100, because they are FinSwe

How well did you know this?

Not at all

Perfectly

How can you encode ordinal attributes?

There is order, so {BSc, MSc, PhD} = {1,2,3}.

How well did you know this?

Not at all

Perfectly

What is syntactic accuracy?

It is the accuracy that the entry belongs to the domain of the attribute. For example entry fmale for sex violates synctactic accuracy. Text entries “one” for numerical data 1 also violate synctactic accuracy.
Can be checked quite easily

How well did you know this?

Not at all

Perfectly

What is semantic accuracy?

Semantic accuracy means whether or not the entry is correct even though it correctly belongs to the domain of attribute. For example PhD and age 10 can be correct for studies and age, but it’s not possible to be PhD at the age of 10.
Hard to check.

How well did you know this?

Not at all

Perfectly

What does completeness of the data mean?

It means that there is no missing entries, for example null entries.

How well did you know this?

Not at all

Perfectly

What is unbiased and representative data?

The data contains all information about the inherent pattern and rules in the data.

How well did you know this?

Not at all

Perfectly

Why is it sometimes hard to get unbiased and representative data?

Machine monitoring: nuclear power plant runs a lot normally, you can’t manually make it run terribly.
Natural disasters: How to predict earthquake in a place where there has not been earthquakes?
Mortgage/insurance: Certain types of customers are missing from the data

How well did you know this?

Not at all

Perfectly

What causes missing values?

Broken sensors
Refusal to answer
Test not done for person
Combined data with different variables between databases
Irrelevant attribute (gas/l for tesla)

How well did you know this?

Not at all

Perfectly

What are the types of missing values?

Study These Flashcards

MCAR (Missing completely at random, aka OAR observed at random), MAR (Missing at random) and nonignorable

What does MCAR mean?

Study These Flashcards

Missing completely at random. The probability that a value for X is missing does not depend on the true value of X or on other variables.

For example battery of a sensor runs out and is replaced at random times.

What is the probability of MCAR at each instance?

Study These Flashcards

The same, missing values follow same distribution as observed

What methods can be used for MCAR values?

Study These Flashcards

Data imputation or ignoring will not bias the analysis, but often MCAR is not realistic assumption.

What is MAR?

Study These Flashcards

Missing at random. The probability that value for X is missing depends on Y. For example battery sensor runs out, and it is replaced during specific times. Or higher earning people are less likely to report their income

What distribution MAR values follow?

Study These Flashcards

Not the same as the distribution for the observed values, so the analysis can be biased.

What methods for MAR values?

Study These Flashcards

We can estimate values for day, if values from night are missing, but the problem is that can we be sure that Y provides all the required information?

What is nonignorable missing value?

Study These Flashcards

The probability that value X is missing depends on the true value of X.

For example temperature sensor doesn’t work at below zero, person with over 1M€ income does not report their income.

Can nonignorable values be estimated?

Study These Flashcards

No, it is impossible to provide any statements about temperatures below 0, if the sensor never works at below 0.

What strategy can be applied, if the domain knowledge doesn't tell the kind of missing values we can expect?

1. Turn the considered attribute X to binary (all measured to YES and all missing to NO) 2. Build a classifier with binary attribute X as the target and use the other attributes for the prediction of YES/NO 3. Determine the misclassification rate. - MCAR: other attributes should not provide any information, so the misclassification should not differ from pure guessing. - If the misclassification is a lot better than pure guessing, then other attributes affect X, so it is not MCAR. But MCAR and MAR cannot be distinguished from nonignorable this way.

Why is data visualisation imoprtant for data understanding and data quality evaluation?

Visualisations reveal patterns or exceptions, there is "something" in the dataset

What causes outliers?

- Data quality problems, like erroneous data from wrong measurements - Exceptional or unusual situations

What should be done to outliers?

Erroneous outliers should be excluded from the analysis. Correct outliers (aka exceptional situations) are sometimes useful to exclude also.

Can outliers be useful?

Yes, for example "whales" in mobile games.

Lecture 2 - Data Understanding I Flashcards

(29 cards)