Exploring Data - Topic 2: Data and Graphical Summaries Flashcards

1
Q

What is Data?

A

Data is info about the set of subjects being studied (like road fatalities). Most commonly, data refers to the sample, not the population (unless it is a census)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are some examples of the different types / formats of data?

A

Survey data
Spreadsheet type data
MRI image data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the Initial Data Analysis (IDA)?

A

It is a first general look at the data, without formally answering the research questions.

The purposes of IDA are to ensure that later statistical analysis can be performed efficiently and to minimise the risk of incorrect or misleading results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

WHat could an IDA assist with?

A

It could assist with:

IDA helping you to see whether data can answer your research questions

IDA posing other research questions

IDA identifying the data’s main qualities and suggesting the population from which a sample derives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What steps does the IDA involve?

A

Commonly involves:

Data background: checking the quality and integrity of the data

Data structure: what info has been collected?

Data wrangling: scraping, cleaning, tidying, reshaping, splitting, combining

Data summaries: graphical and numerical

NOTE: EVERY STEP INVOLVED IN THE IDA HAS TO BE DOCUMENTED AS IT ALLOWS FOR THE DATA TO BE REPRODUCED

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a variable?

A

A variable measures or describes some attribute of the subject. Data with ‘p’ variables is said two have dimension p

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is it called when there is only 1 variable involved?

A

Univariate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is it called when there are 2 variables involved?

A

Bivariate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is it called when there are more than 2 variables involved

A

Multivariate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Would an anonymous identifier such as CRASH ID count as a variable?

A

No it won’t because it doesn’t add any other useful info to the data only allows for recognition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Is recording raw quantitative or qualitative data preferrable?

A

Raw quantitative data if possible, because it can easily be summarised into qualitative data, however it is hard to transfer qualitative data into quantitative data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the two types of variables

A

Qualitative / Categorical or Quantitative / Numerical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are qualitative / categorical variables/data?

A

Qualitative are non-numeric , and includes info like verbal responses to open ended questions which cannot be valued numerically.

Categorical data is a form of qualitative data that can be grouped into categories instead of measured numerically

The answers are typically in words. If the answer is in words –> categorical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is an example of categorical data?

A

WHat is your gender? –> male or female

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are quantitative / numerical variables?

A

It’s value will always be in a number form.

The answers are typically in numbers

Data expressed in numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are examples of numerical data?

A

age and income

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the two types of numerical data?

A

Discrete and Continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is discrete data?

A

Data that can only take certain values is called discrete data. It is data that can be counted and has a limited number of values. It usually comes in the form of whole numbers or integers. The values can’t be broken into smaller parts

20
Q

What is continuous data?

A

This is data that can take any value. The values can be broken into smaller parts into fractions and decimal places etc.

21
Q

Example of discrete data

A

The number of tickets sold in a day

The number of students in your class

22
Q

Example of continuous data

A

Weight of a baby in the first year

Temp of a room throughout a day

23
Q

What are the two types of categorical data?

A

Ordinal (ordered) and Nominal (non-ordered)

24
Q

What is ordinal data?

A

This is data which can be classified into categories that are ranked in a natural order

25
Q

What is an example of ordinal data?

A

The level of education, the range of income, or the grades

26
Q

What is nominal data?

A

It is qualitative data used to name or label variables without providing numeric values

27
Q

What is an example of nominal data?

A

Names of people
Nationalities
Hair colour

28
Q

What is the best graphical representation for qualitative data?

A

Simple bar plot, double bar plot, stacked bar plot, side by side bar plot

BAR PLOTS are the best way to represent qualitative data

Bar plot is for one qualitative data

Double bar plot and/stacked bar plot / side by side bar plot are all other good ways to represent 2 qualitative sets of data

29
Q

What is big data?

A

It is the massive amounts of data being collected in fields such as genomics, astrophysics, marketing and sociology

It is commonly high dimensional, meaning there are more variables ‘p’ than subjects ‘n’

Big data can be described by ‘many ‘v’s’ - high volume, high velocity, high variety, high variability, low veracity/validity, high vulnerability, high volatility and high value

Big data requires more complex visualisations

30
Q

What are the 3 common graphical representations of quantitative data?

A

Histograms, box plots, scatter plots

31
Q

What is missing data represented by (the number) on R?

A

-9

32
Q

What is a histogram? What are its features

A

We use a histogram for quantiative data - to highlight the percentage of data in one class interval compared to another. This can be through a normal histogram and also a density scale histogram

Features include:

Contains a set of blocks which represents the percentages by area

Area of whole histogram is 100%

The horizontal scale is divided into class intervals

The area of each block represents the % of subjects in that particular class interval

The height of each block represents crowding or density (% per horizontal unit)

33
Q

What are the 3 typical choices we make with histograms?

A

There is no need for a vertical scale to assess the relative areas

We will mostly use the density scale

For continuous data we need to establish an endpoint convention for data points that fall on the border of two class intervals. I.e. establishing [0,18), [18,21)

34
Q

What is the formula for density scale?

A

Height of each block = % in the block (number of subjects in % form with reference to total number) / length of the class interval

i.e. height of each block = % per horizontal unit

35
Q

How do you produce a histogram by hand?

A

Construct the distribution table with columns; class interval, number of subjects in the interval, %, height of block.

Then, draw the horizontal axis and blocks with the relative numbers

36
Q

What are 2 common mistakes with histograms?

A

Make the block heights equal to the percentages

Use too many class intervals

37
Q

Explain ‘make the block heights equal to the percentages’ as a common mistake for histograms

A

Here, we wrongly use the % as the heights. Unless the class intervals are all the same size, this will make larger class intervals look like a larger overall %

38
Q

Explain ‘too many class intervals’ as a common mistake for histograms

A

This can overcondense the data, making it look ugly and incomprehensible. As a rule of thumb, only use between 10-15 class intervals

39
Q

WHat is a boxplot?

A

It plots the median (middle data point), the middle 50% of the data in a box and determines any outliers. It also utilises the IQR.

It is useful for comparing multiple quantitative data sets

40
Q

When might I use a comparative boxplot?

A

This can split up a quantitative and a qualitative variable and allow for the comparison of it

41
Q

WHat is a scatter plot?

A

It examines the relationship between 2 quantitative variables (i.e. age and height)

42
Q

What graphical representation should I use if i have 1 qualitative variable / data set

A

Simple bar plot

43
Q

What graphical representation should I use if i have 2 qualitative variable / data set

A

Double bar plot

44
Q

What graphical representation should I use if i have 1 quantitative variable / data set

A

Histogram
Box plot

45
Q

What graphical representation should I use if i have 2 quantitative variable / data set

A

Scatter plot

46
Q

What graphical representation should I use if i have 1 quantitative variable /data set and 1 qualitative variable / data set

A

Comparative box plot or histogram(?)