Data Science Fundamentals Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

What are the two types of variables?

A
  • Numerical
  • Categorical
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What two examples exist of numerical variables?

A

Numerical variables are a combination of the measurement and the unit. Without the unit, a numerical value is just a number.

There are two ways to get a numerical variable. You can have:

  • Discrete - This is where we have a variable that we can count as it gives us a whole number
  • Continuous - This is where we get a partial value as we measure it is there is no full integer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What examples exist of a categorical variable?

A

A categorical variable has three types:

  • Nominal (Think of a name)
  • Dichotomous (Two only as in Di)
  • Ordinal (Think of an order of something like a likert scale)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In the context of missing data, what is Missing at Random data?

A

When we measure data, we can see several observations pop up. If we measure the height of the house and find that all the yellow houses don’t have a height measurement, we can begin to infer something from one variable that says something about this variable.

This kind of missing is called Missing at Random.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What makes a good variable?

A

Suitable variables should only measure:
- One characteristic
- Can’t be characteristic themselves

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In the context of missing data, what is Structurally Random data?

A

Finally, data can be structurally missing, meaning that we wouldn’t expect a value there, to begin with.

An example of collecting data about fruit on our trees. Some trees will have visible fruit. For those trees, we can count how many fruits are visible. If there is no visible fruit, we can’t count how many there are. The number of fruits will be structurally missing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is data science?

A

Data science is a collection of organised observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the framework underlying statistics?

A

Statistics follow a four-step framework:

  1. We want to know something about an underlying population
  2. We take a sample that hopefully is a good representation of the whole population
  3. We create some statistics from this
  4. We infer insights from this data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a dichotomous variable?

A

It has only two options. In the example of a tree being grown, it is a tree growing by itself or not. We would record this with the following. Is the tree by itself?

There are only two logical options in this scenario.

  • Single
  • Not single

For a variable to be dichotomous, it must have only two logical options

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In the context of missing data, what is Missing Completely at Random data?

A

During our data collection, if we make an error in the collection this could just be human error coming through. This isn’t intentional. This is just means we’ve missed completely random due to happenstance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is an ordinal variable?

A

These are variables we measure using some form of scale that is then represented numerically (This is still a categorical variable). An example is a likert scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What types of organisations are there?

A
  • Methodology
  • Shape

Ask more about this question at some point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the shape of data?

A

The shape of a distribution is described by its number of peaks, its possession of symmetry, its tendency to skew, or its uniformity. (Distributions that are skewed have more points plotted on one side of the graph than on the other.) PEAKS: Graphs often display peaks, or local maximums.

Think about the collection of data.When we collect data we want to create variables. If we were counting houses, we could use two variables:

  • Height
  • Colour

These things vary from house to house. A house can be called an entity, observation or instance, these terms are used interchangeably. In a well-organised dataset, the variables describe a characteristic of the entity and they also form the shape of what the data will look like.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a nominal variable?

A

This is a variable that we give a name to hence nominal as we’re nominating a name for this

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is accuracy?

A

Accuracy is about understanding how your data can be matched against real-world expectations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is standardisation when talking about data collection?

A

Standardisation is a practice that is used to ensure that we collect the data accurately. This revolves around:

  1. Standardising how we measure data - Do we have an agreed-upon measurement framework?
  2. Standardising the units we use - Are we agreed upon the same units of measurement we want to use?
17
Q

What are interchangeable words used for a variable?

A

Instance, observation and entity

18
Q

How do we assess if a dataset has low accuracy?

A
  1. Think about our data and how it appears when it comes to common sense and reality. Is the data accurate? We can observe this by looking for outliers and distribution. If a tree is 1000 M tall, we’d ask questions about how it was measured.
  2. Consider how errors could have crept in during the data collection process.
  3. Identifying ways that duplicate values could have crept in goes a long way to ensuring that we have valid results. A good way to look into this is to look at programmatically generated data versus human-gathered data.

The goal of a data scientist is to look for a surprising result versus error.