Lecture 2 - Data Understanding I Flashcards
What are the goals of data understanding?
- Gain general insight
- Check the assumptions
- Check the specific domain knowledge
- Suitability of data for goals
What are the rows and columns called?
Rows: Instances, records, samples, data objects
Columns: Features, attributes, variables
What are the types of attributes?
- Categorical (nominal): Classes or categories {Finnish, Swedish}
- Ordinal: Linear order {BSc, MSc, PhD}
- Numerical: Values are numbers
- Discrete: Categorical attribute or numerical attribute whose domain is a subset of the integer number
- Interval scale: the definition of value 0 is arbitrary, ratios are meaningless (like celsius)
- Ratio scale: 0 has a canonical meaning, ratios make sense (distance)
- Absolute scale (for example number of people)
What does the level of granularity mean?
More defined information, for example:
Drinks -> water -> producer -> size 0.33l
More detailed has more information, but might not help discover general associations (like wine and cheese bought together)
What can be a probelm for categorical attributes?
Dynamic domains: the possible values, like the products name or amount of sales for new product can change over time
What is a categorical variable with two possible values?
Binary attribute
Does the order of binary attribute matter?
No, can be 1 = FALSE 0 = TRUE or 1 = TRUE 0 = FALSE
How can you encode categorical attributes? Example of {Finland, Sweden, Denmark, Norway}.
One hot encoding.
Can’t be 1,2,3,4 because there is no order.
Transform each into binary variable.
isFIN = 0/1, isSWE = 0/1 etc.
Categorical feature with k possible values can be transformed into:
Fin = 1000, Swe = 0100, Den = 0010, Norway = 0001
Then people like Björn can be 1100, because they are FinSwe
How can you encode ordinal attributes?
There is order, so {BSc, MSc, PhD} = {1,2,3}.
What is syntactic accuracy?
It is the accuracy that the entry belongs to the domain of the attribute. For example entry fmale for sex violates synctactic accuracy. Text entries “one” for numerical data 1 also violate synctactic accuracy.
Can be checked quite easily
What is semantic accuracy?
Semantic accuracy means whether or not the entry is correct even though it correctly belongs to the domain of attribute. For example PhD and age 10 can be correct for studies and age, but it’s not possible to be PhD at the age of 10.
Hard to check.
What does completeness of the data mean?
It means that there is no missing entries, for example null entries.
What is unbiased and representative data?
The data contains all information about the inherent pattern and rules in the data.
Why is it sometimes hard to get unbiased and representative data?
Machine monitoring: nuclear power plant runs a lot normally, you can’t manually make it run terribly.
Natural disasters: How to predict earthquake in a place where there has not been earthquakes?
Mortgage/insurance: Certain types of customers are missing from the data
What causes missing values?
- Broken sensors
- Refusal to answer
- Test not done for person
- Combined data with different variables between databases
- Irrelevant attribute (gas/l for tesla)