Introduction to data types and quality Flashcards

Question 1

Q

data in data science mean:

Answer

A

collection of organized observations

Question 2

Q

There are two types of organization

Answer

A

methodology and shape.

Question 3

Q

The methodology is

Answer

A

how the data was collected

Question 4

Q

The most common shape for data is

Answer

A

spreadsheet or table.

Question 5

Q

The most common shape for data is a spreadsheet or table.

Answer

A

The things we are measuring (variables) are in the columns, and the individual instances (observations) are in the rows.
We can read each column “down” the table (viewing multiple observations), and each row “across” the table (viewing multiple variables).
This isn’t the only way to organize data, but it is the most common.
the Imperial measurements (the American system), so will be collecting the data in feet and miles.

Question 6

Q

the shape of data:
each individual is called

Answer

A

an entity, observation, or instance
but know that these three terms are used interchangeably.

Question 7

Q

In a well-organized dataset, the variables describe……….

Answer

A

a characteristic of our entities.

Question 8

Q

Good variables measure

Answer

A

only one characteristic and should not be a characteristic themselves.

Question 9

Q

Variable Types
The difference between measuring and categorizing is so important that the data itself is termed differently:

Answer

A

Variables that are measured are Numerical variables
Variables that are categorized are Categorical variables

Question 10

Q

Numerical variables

Answer

A

Numerical variables are a combination of the measurement and the unit. Without the unit, a numerical variable is just a number.
Imagine I go into a cafe and ask the barista for 3. Three what? ☕? 🍩? 💵? Or my friend asks how far Toledo is and I say 300. 300 miles? Kilometers? Minutes? Without units, numbers don’t mean anything.
There are two ways to get a number: by counting and measuring. Counting gives us whole numbers and discrete variables. Measuring gives us potentially partial values and continuous variables.
In our tree census, we are measuring the height of our trees in feet (indicated in the variable name, ‘Height (ft)’), a continuous variable.

Question 11

Q

Categorical variables

Answer

A

Categorical variables describe characteristics with words or relative values.

Question 12

Q

This kind of categorical variable is a nominal variable which literally means

Answer

A

a named value.

Question 13

Q

Categorical variables:
Dichotomous variables

Answer

A

have only 2 logical possibilities, “on/off”, “yes/no”, “true/false”, “0/1”, there’s no middle ground and no 3rd option. If there is a logical third option, it’s not a dichotomous variable.

Question 14

Q

Categorical variables:
ordinal variable

Answer

A

let’s say that we wanted to capture how “pretty” we thought each tree was. This isn’t really a thing we can measure, but we can subjectively say on a scale of 1 to 5, how pretty we think each tree is. The prettiest trees are a 5, the least pretty trees are a 1.
That ranking is inherently ordered and therefore called an ordinal variable.

Question 15

Q

Ordinal variables are really popular in survey design “on a scale of 1-5 how much do you agree with this statement?” This is called a

Answer

A

a likert scale. They also show up in the Olympics and other competitions where someone wins 1st, 2nd, or 3rd place.

Question 16

Q

Ordinal variables can get a little confusing because they are often represented as numbers. But they don’t represent measurements or counts, they represent

Answer

Study These Flashcards

A

categories

Question 17

Q

cleaning data involves a lot of

Answer

Study These Flashcards

A

critical thinking considering the nuances of the dataset you are working with.

Question 18

Q

Accuracy

Answer

Study These Flashcards

A

is a measure of how well records reflect reality.

Question 19

Q

essential for accuracy

Answer

Study These Flashcards

A

Standardization is essential for accuracy – but it’s not the only way that accuracy can be compromised.

Question 20

Q

There are a lot of ways a dataset can have low accuracy, but it all comes down to the question of: “are these measurements (or categorizations) correct?” It requires a critical evaluation of your specific dataset to identify what the issues are, but there are a few ways to think about it.

Answer

Study These Flashcards

A

First, thinking about the data against expectations and common sense is crucial for spotting issues with accuracy. You can do this by inspecting the distribution and outliers to get clues about what the data looks like.
Second, critically considering how error could have crept in during the data collection process will help you group and evaluate the data to uncover systematic inconsistencies.
Finally, identifying ways that duplicate values could have been created goes a long way towards ensuring that reality is only represented once in your data. A useful technique is to distinguish between what was human collected versus programmatically generated and using that distinction to segment the data.

Question 21

Q

It’s not just typos, mistakes, missing data, poor measurement, and duplicated observations that make a dataset low quality. We also have to make sure that our data actually measures what we think it is measuring.

Answer

Study These Flashcards

A

This is the validity of our dataset.

Question 22

Q

Validity is a special kind of quality measure because it’s not just about the dataset

Answer

Study These Flashcards

A

it’s about the relationship between the dataset and its purpose. A dataset can be valid for one question and invalid for another.

Introduction to data types and quality Flashcards

(22 cards)