Exploring Data - Topic 2: Data and Graphical Summaries Flashcards
What is Data?
Data is info about the set of subjects being studied (like road fatalities). Most commonly, data refers to the sample, not the population (unless it is a census)
What are some examples of the different types / formats of data?
Survey data
Spreadsheet type data
MRI image data
What is the Initial Data Analysis (IDA)?
It is a first general look at the data, without formally answering the research questions.
The purposes of IDA are to ensure that later statistical analysis can be performed efficiently and to minimise the risk of incorrect or misleading results
WHat could an IDA assist with?
It could assist with:
IDA helping you to see whether data can answer your research questions
IDA posing other research questions
IDA identifying the data’s main qualities and suggesting the population from which a sample derives
What steps does the IDA involve?
Commonly involves:
Data background: checking the quality and integrity of the data
Data structure: what info has been collected?
Data wrangling: scraping, cleaning, tidying, reshaping, splitting, combining
Data summaries: graphical and numerical
NOTE: EVERY STEP INVOLVED IN THE IDA HAS TO BE DOCUMENTED AS IT ALLOWS FOR THE DATA TO BE REPRODUCED
What is a variable?
A variable measures or describes some attribute of the subject. Data with ‘p’ variables is said two have dimension p
What is it called when there is only 1 variable involved?
Univariate
What is it called when there are 2 variables involved?
Bivariate
What is it called when there are more than 2 variables involved
Multivariate
Would an anonymous identifier such as CRASH ID count as a variable?
No it won’t because it doesn’t add any other useful info to the data only allows for recognition
Is recording raw quantitative or qualitative data preferrable?
Raw quantitative data if possible, because it can easily be summarised into qualitative data, however it is hard to transfer qualitative data into quantitative data
What are the two types of variables
Qualitative / Categorical or Quantitative / Numerical
What are qualitative / categorical variables/data?
Qualitative are non-numeric , and includes info like verbal responses to open ended questions which cannot be valued numerically.
Categorical data is a form of qualitative data that can be grouped into categories instead of measured numerically
The answers are typically in words. If the answer is in words –> categorical
What is an example of categorical data?
WHat is your gender? –> male or female
What are quantitative / numerical variables?
It’s value will always be in a number form.
The answers are typically in numbers
Data expressed in numbers
What are examples of numerical data?
age and income
What are the two types of numerical data?
Discrete and Continuous
What is discrete data?
Data that can only take certain values is called discrete data. It is data that can be counted and has a limited number of values. It usually comes in the form of whole numbers or integers. The values can’t be broken into smaller parts
What is continuous data?
This is data that can take any value. The values can be broken into smaller parts into fractions and decimal places etc.
Example of discrete data
The number of tickets sold in a day
The number of students in your class
Example of continuous data
Weight of a baby in the first year
Temp of a room throughout a day
What are the two types of categorical data?
Ordinal (ordered) and Nominal (non-ordered)
What is ordinal data?
This is data which can be classified into categories that are ranked in a natural order
What is an example of ordinal data?
The level of education, the range of income, or the grades
What is nominal data?
It is qualitative data used to name or label variables without providing numeric values
What is an example of nominal data?
Names of people
Nationalities
Hair colour
What is the best graphical representation for qualitative data?
Simple bar plot, double bar plot, stacked bar plot, side by side bar plot
BAR PLOTS are the best way to represent qualitative data
Bar plot is for one qualitative data
Double bar plot and/stacked bar plot / side by side bar plot are all other good ways to represent 2 qualitative sets of data
What is big data?
It is the massive amounts of data being collected in fields such as genomics, astrophysics, marketing and sociology
It is commonly high dimensional, meaning there are more variables ‘p’ than subjects ‘n’
Big data can be described by ‘many ‘v’s’ - high volume, high velocity, high variety, high variability, low veracity/validity, high vulnerability, high volatility and high value
Big data requires more complex visualisations
What are the 3 common graphical representations of quantitative data?
Histograms, box plots, scatter plots
What is missing data represented by (the number) on R?
-9
What is a histogram? What are its features
We use a histogram for quantiative data - to highlight the percentage of data in one class interval compared to another. This can be through a normal histogram and also a density scale histogram
Features include:
Contains a set of blocks which represents the percentages by area
Area of whole histogram is 100%
The horizontal scale is divided into class intervals
The area of each block represents the % of subjects in that particular class interval
The height of each block represents crowding or density (% per horizontal unit)
What are the 3 typical choices we make with histograms?
There is no need for a vertical scale to assess the relative areas
We will mostly use the density scale
For continuous data we need to establish an endpoint convention for data points that fall on the border of two class intervals. I.e. establishing [0,18), [18,21)
What is the formula for density scale?
Height of each block = % in the block (number of subjects in % form with reference to total number) / length of the class interval
i.e. height of each block = % per horizontal unit
How do you produce a histogram by hand?
Construct the distribution table with columns; class interval, number of subjects in the interval, %, height of block.
Then, draw the horizontal axis and blocks with the relative numbers
What are 2 common mistakes with histograms?
Make the block heights equal to the percentages
Use too many class intervals
Explain ‘make the block heights equal to the percentages’ as a common mistake for histograms
Here, we wrongly use the % as the heights. Unless the class intervals are all the same size, this will make larger class intervals look like a larger overall %
Explain ‘too many class intervals’ as a common mistake for histograms
This can overcondense the data, making it look ugly and incomprehensible. As a rule of thumb, only use between 10-15 class intervals
WHat is a boxplot?
It plots the median (middle data point), the middle 50% of the data in a box and determines any outliers. It also utilises the IQR.
It is useful for comparing multiple quantitative data sets
When might I use a comparative boxplot?
This can split up a quantitative and a qualitative variable and allow for the comparison of it
WHat is a scatter plot?
It examines the relationship between 2 quantitative variables (i.e. age and height)
What graphical representation should I use if i have 1 qualitative variable / data set
Simple bar plot
What graphical representation should I use if i have 2 qualitative variable / data set
Double bar plot
What graphical representation should I use if i have 1 quantitative variable / data set
Histogram
Box plot
What graphical representation should I use if i have 2 quantitative variable / data set
Scatter plot
What graphical representation should I use if i have 1 quantitative variable /data set and 1 qualitative variable / data set
Comparative box plot or histogram(?)