Data Science Fundamentals Flashcards

Question 1

Q

What are the two types of variables?

Answer

A

Numerical
Categorical

Question 2

Q

What two examples exist of numerical variables?

Answer

A

Numerical variables are a combination of the measurement and the unit. Without the unit, a numerical value is just a number.

There are two ways to get a numerical variable. You can have:

Discrete - This is where we have a variable that we can count as it gives us a whole number
Continuous - This is where we get a partial value as we measure it is there is no full integer

Question 3

Q

What examples exist of a categorical variable?

Answer

A

A categorical variable has three types:

Nominal (Think of a name)
Dichotomous (Two only as in Di)
Ordinal (Think of an order of something like a likert scale)

Question 4

Q

In the context of missing data, what is Missing at Random data?

Answer

A

When we measure data, we can see several observations pop up. If we measure the height of the house and find that all the yellow houses don’t have a height measurement, we can begin to infer something from one variable that says something about this variable.

This kind of missing is called Missing at Random.

Question 5

Q

What makes a good variable?

Answer

A

Suitable variables should only measure:
- One characteristic
- Can’t be characteristic themselves

Question 6

Q

In the context of missing data, what is Structurally Random data?

Answer

A

Finally, data can be structurally missing, meaning that we wouldn’t expect a value there, to begin with.

An example of collecting data about fruit on our trees. Some trees will have visible fruit. For those trees, we can count how many fruits are visible. If there is no visible fruit, we can’t count how many there are. The number of fruits will be structurally missing.

Question 7

Q

What is data science?

Answer

A

Data science is a collection of organised observations

Question 8

Q

What is the framework underlying statistics?

Answer

A

Statistics follow a four-step framework:

We want to know something about an underlying population
We take a sample that hopefully is a good representation of the whole population
We create some statistics from this
We infer insights from this data

Question 9

Q

What is a dichotomous variable?

Answer

A

It has only two options. In the example of a tree being grown, it is a tree growing by itself or not. We would record this with the following. Is the tree by itself?

There are only two logical options in this scenario.

Single
Not single

For a variable to be dichotomous, it must have only two logical options

Question 10

Q

In the context of missing data, what is Missing Completely at Random data?

Answer

A

During our data collection, if we make an error in the collection this could just be human error coming through. This isn’t intentional. This is just means we’ve missed completely random due to happenstance.

Question 11

Q

What is an ordinal variable?

Answer

A

These are variables we measure using some form of scale that is then represented numerically (This is still a categorical variable). An example is a likert scale

Question 12

Q

What types of organisations are there?

Answer

A

Methodology
Shape

Ask more about this question at some point

Question 13

Q

What is the shape of data?

Answer

A

The shape of a distribution is described by its number of peaks, its possession of symmetry, its tendency to skew, or its uniformity. (Distributions that are skewed have more points plotted on one side of the graph than on the other.) PEAKS: Graphs often display peaks, or local maximums.

Think about the collection of data.When we collect data we want to create variables. If we were counting houses, we could use two variables:

Height
Colour

These things vary from house to house. A house can be called an entity, observation or instance, these terms are used interchangeably. In a well-organised dataset, the variables describe a characteristic of the entity and they also form the shape of what the data will look like.

Question 14

Q

What is a nominal variable?

Answer

A

This is a variable that we give a name to hence nominal as we’re nominating a name for this

Question 15

Q

What is accuracy?

Answer

A

Accuracy is about understanding how your data can be matched against real-world expectations.

Question 16

Q

What is standardisation when talking about data collection?

Answer

Study These Flashcards

A

Standardisation is a practice that is used to ensure that we collect the data accurately. This revolves around:

Standardising how we measure data - Do we have an agreed-upon measurement framework?
Standardising the units we use - Are we agreed upon the same units of measurement we want to use?

Question 17

Q

What are interchangeable words used for a variable?

Answer

Study These Flashcards

A

Instance, observation and entity

Question 18

Q

How do we assess if a dataset has low accuracy?

Answer

Study These Flashcards

A

Think about our data and how it appears when it comes to common sense and reality. Is the data accurate? We can observe this by looking for outliers and distribution. If a tree is 1000 M tall, we’d ask questions about how it was measured.
Consider how errors could have crept in during the data collection process.
Identifying ways that duplicate values could have crept in goes a long way to ensuring that we have valid results. A good way to look into this is to look at programmatically generated data versus human-gathered data.

The goal of a data scientist is to look for a surprising result versus error.

Data Science Fundamentals Flashcards

(18 cards)