Introduction to data types and quality Flashcards
data in data science mean:
collection of organized observations
There are two types of organization
methodology and shape.
The methodology is
how the data was collected
The most common shape for data is
spreadsheet or table.
The most common shape for data is a spreadsheet or table.
The things we are measuring (variables) are in the columns, and the individual instances (observations) are in the rows.
We can read each column “down” the table (viewing multiple observations), and each row “across” the table (viewing multiple variables).
This isn’t the only way to organize data, but it is the most common.
the Imperial measurements (the American system), so will be collecting the data in feet and miles.
the shape of data:
each individual is called
an entity, observation, or instance
but know that these three terms are used interchangeably.
In a well-organized dataset, the variables describe……….
a characteristic of our entities.
Good variables measure
only one characteristic and should not be a characteristic themselves.
Variable Types
The difference between measuring and categorizing is so important that the data itself is termed differently:
- Variables that are measured are Numerical variables
- Variables that are categorized are Categorical variables
Numerical variables
Numerical variables are a combination of the measurement and the unit. Without the unit, a numerical variable is just a number.
Imagine I go into a cafe and ask the barista for 3. Three what? ☕? 🍩? 💵? Or my friend asks how far Toledo is and I say 300. 300 miles? Kilometers? Minutes? Without units, numbers don’t mean anything.
There are two ways to get a number: by counting and measuring. Counting gives us whole numbers and discrete variables. Measuring gives us potentially partial values and continuous variables.
In our tree census, we are measuring the height of our trees in feet (indicated in the variable name, ‘Height (ft)’), a continuous variable.
Categorical variables
Categorical variables describe characteristics with words or relative values.
This kind of categorical variable is a nominal variable which literally means
a named value.
Categorical variables:
Dichotomous variables
have only 2 logical possibilities, “on/off”, “yes/no”, “true/false”, “0/1”, there’s no middle ground and no 3rd option. If there is a logical third option, it’s not a dichotomous variable.
Categorical variables:
ordinal variable
let’s say that we wanted to capture how “pretty” we thought each tree was. This isn’t really a thing we can measure, but we can subjectively say on a scale of 1 to 5, how pretty we think each tree is. The prettiest trees are a 5, the least pretty trees are a 1.
That ranking is inherently ordered and therefore called an ordinal variable.
Ordinal variables are really popular in survey design “on a scale of 1-5 how much do you agree with this statement?” This is called a
a likert scale. They also show up in the Olympics and other competitions where someone wins 1st, 2nd, or 3rd place.