Data Science Fundamentals Flashcards
What are the two types of variables?
- Numerical
- Categorical
What two examples exist of numerical variables?
Numerical variables are a combination of the measurement and the unit. Without the unit, a numerical value is just a number.
There are two ways to get a numerical variable. You can have:
- Discrete - This is where we have a variable that we can count as it gives us a whole number
- Continuous - This is where we get a partial value as we measure it is there is no full integer
What examples exist of a categorical variable?
A categorical variable has three types:
- Nominal (Think of a name)
- Dichotomous (Two only as in Di)
- Ordinal (Think of an order of something like a likert scale)
In the context of missing data, what is Missing at Random data?
When we measure data, we can see several observations pop up. If we measure the height of the house and find that all the yellow houses don’t have a height measurement, we can begin to infer something from one variable that says something about this variable.
This kind of missing is called Missing at Random.
What makes a good variable?
Suitable variables should only measure:
- One characteristic
- Can’t be characteristic themselves
In the context of missing data, what is Structurally Random data?
Finally, data can be structurally missing, meaning that we wouldn’t expect a value there, to begin with.
An example of collecting data about fruit on our trees. Some trees will have visible fruit. For those trees, we can count how many fruits are visible. If there is no visible fruit, we can’t count how many there are. The number of fruits will be structurally missing.
What is data science?
Data science is a collection of organised observations
What is the framework underlying statistics?
Statistics follow a four-step framework:
- We want to know something about an underlying population
- We take a sample that hopefully is a good representation of the whole population
- We create some statistics from this
- We infer insights from this data
What is a dichotomous variable?
It has only two options. In the example of a tree being grown, it is a tree growing by itself or not. We would record this with the following. Is the tree by itself?
There are only two logical options in this scenario.
- Single
- Not single
For a variable to be dichotomous, it must have only two logical options
In the context of missing data, what is Missing Completely at Random data?
During our data collection, if we make an error in the collection this could just be human error coming through. This isn’t intentional. This is just means we’ve missed completely random due to happenstance.
What is an ordinal variable?
These are variables we measure using some form of scale that is then represented numerically (This is still a categorical variable). An example is a likert scale
What types of organisations are there?
- Methodology
- Shape
Ask more about this question at some point
What is the shape of data?
The shape of a distribution is described by its number of peaks, its possession of symmetry, its tendency to skew, or its uniformity. (Distributions that are skewed have more points plotted on one side of the graph than on the other.) PEAKS: Graphs often display peaks, or local maximums.
Think about the collection of data.When we collect data we want to create variables. If we were counting houses, we could use two variables:
- Height
- Colour
These things vary from house to house. A house can be called an entity, observation or instance, these terms are used interchangeably. In a well-organised dataset, the variables describe a characteristic of the entity and they also form the shape of what the data will look like.
What is a nominal variable?
This is a variable that we give a name to hence nominal as we’re nominating a name for this
What is accuracy?
Accuracy is about understanding how your data can be matched against real-world expectations.