Introduction to data types and quality Flashcards
data in data science mean:
collection of organized observations
There are two types of organization
methodology and shape.
The methodology is
how the data was collected
The most common shape for data is
spreadsheet or table.
The most common shape for data is a spreadsheet or table.
The things we are measuring (variables) are in the columns, and the individual instances (observations) are in the rows.
We can read each column “down” the table (viewing multiple observations), and each row “across” the table (viewing multiple variables).
This isn’t the only way to organize data, but it is the most common.
the Imperial measurements (the American system), so will be collecting the data in feet and miles.
the shape of data:
each individual is called
an entity, observation, or instance
but know that these three terms are used interchangeably.
In a well-organized dataset, the variables describe……….
a characteristic of our entities.
Good variables measure
only one characteristic and should not be a characteristic themselves.
Variable Types
The difference between measuring and categorizing is so important that the data itself is termed differently:
- Variables that are measured are Numerical variables
- Variables that are categorized are Categorical variables
Numerical variables
Numerical variables are a combination of the measurement and the unit. Without the unit, a numerical variable is just a number.
Imagine I go into a cafe and ask the barista for 3. Three what? ☕? 🍩? 💵? Or my friend asks how far Toledo is and I say 300. 300 miles? Kilometers? Minutes? Without units, numbers don’t mean anything.
There are two ways to get a number: by counting and measuring. Counting gives us whole numbers and discrete variables. Measuring gives us potentially partial values and continuous variables.
In our tree census, we are measuring the height of our trees in feet (indicated in the variable name, ‘Height (ft)’), a continuous variable.
Categorical variables
Categorical variables describe characteristics with words or relative values.
This kind of categorical variable is a nominal variable which literally means
a named value.
Categorical variables:
Dichotomous variables
have only 2 logical possibilities, “on/off”, “yes/no”, “true/false”, “0/1”, there’s no middle ground and no 3rd option. If there is a logical third option, it’s not a dichotomous variable.
Categorical variables:
ordinal variable
let’s say that we wanted to capture how “pretty” we thought each tree was. This isn’t really a thing we can measure, but we can subjectively say on a scale of 1 to 5, how pretty we think each tree is. The prettiest trees are a 5, the least pretty trees are a 1.
That ranking is inherently ordered and therefore called an ordinal variable.
Ordinal variables are really popular in survey design “on a scale of 1-5 how much do you agree with this statement?” This is called a
a likert scale. They also show up in the Olympics and other competitions where someone wins 1st, 2nd, or 3rd place.
Ordinal variables can get a little confusing because they are often represented as numbers. But they don’t represent measurements or counts, they represent
categories
cleaning data involves a lot of
critical thinking considering the nuances of the dataset you are working with.
Accuracy
is a measure of how well records reflect reality.
essential for accuracy
Standardization is essential for accuracy – but it’s not the only way that accuracy can be compromised.
There are a lot of ways a dataset can have low accuracy, but it all comes down to the question of: “are these measurements (or categorizations) correct?” It requires a critical evaluation of your specific dataset to identify what the issues are, but there are a few ways to think about it.
- First, thinking about the data against expectations and common sense is crucial for spotting issues with accuracy. You can do this by inspecting the distribution and outliers to get clues about what the data looks like.
- Second, critically considering how error could have crept in during the data collection process will help you group and evaluate the data to uncover systematic inconsistencies.
- Finally, identifying ways that duplicate values could have been created goes a long way towards ensuring that reality is only represented once in your data. A useful technique is to distinguish between what was human collected versus programmatically generated and using that distinction to segment the data.
It’s not just typos, mistakes, missing data, poor measurement, and duplicated observations that make a dataset low quality. We also have to make sure that our data actually measures what we think it is measuring.
This is the validity of our dataset.
Validity is a special kind of quality measure because it’s not just about the dataset
it’s about the relationship between the dataset and its purpose. A dataset can be valid for one question and invalid for another.