Visualizations and Exploratory Analysis Flashcards
Think about a database. What is an attribute (or dimension, feature, variable)?
An attribute is a data field, representing a characteristic or a feature of a data object.
e.g., customer_ID, name, address
It is essentially what you see in columns
When talking about a database, the rows are data objects and the columns are object’s attributes. True or false?
Very true.
Give some examples of the following attribute types:
a. Nominal
b. Binary
c. Ordinal
d. Numeric (quantity, interval, ratio)
Nominal = categories, states, or “names of things”
e.g., hair color, marital status, occupation
Binary
- nominal attribute with only two states (0 and 1)
- symmetric binary = both outcomes are equally important (e.g. gender)
- asymmetric binary = outcomes are not equally important (e.g. medical test, positive or negative)
Ordinal = values have a meaningful order but the magnitude between successive values is unknown
e.g., small medium large
Numeric
- quantity: integer, real number
- interval: measured on a scale of equal-sized units
- ratio: e.g., length, monetary quantities
What is the difference between discrete and continuous attributes?
Discrete - has only a finite or countably infinite set of values (e.g., zip codes, profession)
Continuous - has real numbers (e.g., temperature, height, weight), technically they can be unbounded, but in reality not really
Mean = ?
Average. you know what it is
Median = ?
Middle value if odd number of values, or average of the middle two values otherwise
Mode = ?
Value that occurs most frequently in the data
In a bell shaped distribution, mean = median = mode. True or false?
True
What does a box-plot graph display?
Minimum, Q1 (first quartile), Median, Q3, Maximum
What is an outlier?
A point beyond a specified outlier threshold. Can be easily (or not?) spotted in graphs
Do you know the properties of a normal distribution curve?
You should. Check online.
What is a histogram?
A graph display of tabulated frequencies, shown as bars
What is a quantile plot?
A plot that displays all of the data for two variables, allowing the user to assess both overall behavior and unusual occurrences
Why is a scatter plot useful?
Provides a first look at bivariate data to see clusters of points, outliers etc.