Lecture 2 - Data Understanding I Flashcards
What are the goals of data understanding?
- Gain general insight
- Check the assumptions
- Check the specific domain knowledge
- Suitability of data for goals
What are the rows and columns called?
Rows: Instances, records, samples, data objects
Columns: Features, attributes, variables
What are the types of attributes?
- Categorical (nominal): Classes or categories {Finnish, Swedish}
- Ordinal: Linear order {BSc, MSc, PhD}
- Numerical: Values are numbers
- Discrete: Categorical attribute or numerical attribute whose domain is a subset of the integer number
- Interval scale: the definition of value 0 is arbitrary, ratios are meaningless (like celsius)
- Ratio scale: 0 has a canonical meaning, ratios make sense (distance)
- Absolute scale (for example number of people)
What does the level of granularity mean?
More defined information, for example:
Drinks -> water -> producer -> size 0.33l
More detailed has more information, but might not help discover general associations (like wine and cheese bought together)
What can be a probelm for categorical attributes?
Dynamic domains: the possible values, like the products name or amount of sales for new product can change over time
What is a categorical variable with two possible values?
Binary attribute
Does the order of binary attribute matter?
No, can be 1 = FALSE 0 = TRUE or 1 = TRUE 0 = FALSE
How can you encode categorical attributes? Example of {Finland, Sweden, Denmark, Norway}.
One hot encoding.
Can’t be 1,2,3,4 because there is no order.
Transform each into binary variable.
isFIN = 0/1, isSWE = 0/1 etc.
Categorical feature with k possible values can be transformed into:
Fin = 1000, Swe = 0100, Den = 0010, Norway = 0001
Then people like Björn can be 1100, because they are FinSwe
How can you encode ordinal attributes?
There is order, so {BSc, MSc, PhD} = {1,2,3}.
What is syntactic accuracy?
It is the accuracy that the entry belongs to the domain of the attribute. For example entry fmale for sex violates synctactic accuracy. Text entries “one” for numerical data 1 also violate synctactic accuracy.
Can be checked quite easily
What is semantic accuracy?
Semantic accuracy means whether or not the entry is correct even though it correctly belongs to the domain of attribute. For example PhD and age 10 can be correct for studies and age, but it’s not possible to be PhD at the age of 10.
Hard to check.
What does completeness of the data mean?
It means that there is no missing entries, for example null entries.
What is unbiased and representative data?
The data contains all information about the inherent pattern and rules in the data.
Why is it sometimes hard to get unbiased and representative data?
Machine monitoring: nuclear power plant runs a lot normally, you can’t manually make it run terribly.
Natural disasters: How to predict earthquake in a place where there has not been earthquakes?
Mortgage/insurance: Certain types of customers are missing from the data
What causes missing values?
- Broken sensors
- Refusal to answer
- Test not done for person
- Combined data with different variables between databases
- Irrelevant attribute (gas/l for tesla)
What are the types of missing values?
MCAR (Missing completely at random, aka OAR observed at random), MAR (Missing at random) and nonignorable
What does MCAR mean?
Missing completely at random. The probability that a value for X is missing does not depend on the true value of X or on other variables.
For example battery of a sensor runs out and is replaced at random times.
What is the probability of MCAR at each instance?
The same, missing values follow same distribution as observed
What methods can be used for MCAR values?
Data imputation or ignoring will not bias the analysis, but often MCAR is not realistic assumption.
What is MAR?
Missing at random. The probability that value for X is missing depends on Y. For example battery sensor runs out, and it is replaced during specific times. Or higher earning people are less likely to report their income
What distribution MAR values follow?
Not the same as the distribution for the observed values, so the analysis can be biased.
What methods for MAR values?
We can estimate values for day, if values from night are missing, but the problem is that can we be sure that Y provides all the required information?
What is nonignorable missing value?
The probability that value X is missing depends on the true value of X.
For example temperature sensor doesn’t work at below zero, person with over 1M€ income does not report their income.
Can nonignorable values be estimated?
No, it is impossible to provide any statements about temperatures below 0, if the sensor never works at below 0.
What strategy can be applied, if the domain knowledge doesn’t tell the kind of missing values we can expect?
- Turn the considered attribute X to binary (all measured to YES and all missing to NO)
- Build a classifier with binary attribute X as the target and use the other attributes for the prediction of YES/NO
- Determine the misclassification rate.
- MCAR: other attributes should not provide any information, so the misclassification should not differ from pure guessing.
- If the misclassification is a lot better than pure guessing, then other attributes affect X, so it is not MCAR. But MCAR and MAR cannot be distinguished from nonignorable this way.
Why is data visualisation imoprtant for data understanding and data quality evaluation?
Visualisations reveal patterns or exceptions, there is “something” in the dataset
What causes outliers?
- Data quality problems, like erroneous data from wrong measurements
- Exceptional or unusual situations
What should be done to outliers?
Erroneous outliers should be excluded from the analysis. Correct outliers (aka exceptional situations) are sometimes useful to exclude also.
Can outliers be useful?
Yes, for example “whales” in mobile games.