Topic 2: Data & Graphical Summaries Flashcards
When you first look at the data to get the general snapshot without actually answering the research question, what process is it?
Initial Data Analysis (IDA)
During the first snapshot of the data, what criteria are used to analyse it?
- Data background (quality + integrity)
- Data structure (characteristics of data including types, size, number)
- Data wrangling (making changes like scraping, cleaning, tidying, reshaping, splitting, combining)
- Data summaries (graphical & numerical)
Why is having the first look at the data critical?
IDA helps to capture the main qualities of the data as well as suggesting about the population.
Also, helps to analyse whether the data can answer the research question and pose follow-up ones
Which two aspects need to be considered when analyzing the data?
Size: how many variables/subjects (p/n)
Type: quantitative or qualitative
How can size of data be described?
Multivariate (2+ variables)
+ Bivariate (2 variables)
+ Univariate (1 variable)
Explain different types of data
- Qualitative (category):
+ Ordinal (in order): binary/3+
+ Norminal: binary/3+ - Quantitative (measurement):
+ Discrete (separated)
+ Continous (continuum)
Explain data and variable
Data is information about sets of subjects being studied.
Variables are different measurement or categories describing attributes of the subjects.
What type of graphical summary can be used for 1/2 qualitative variables?
1 qual variable: simple bar plot (table before plot)
2 qual variable: double bar plot (stacked or side-by-side)
What type of graphical summary can be used for 1/2 quantitative variables?
1 quan variable: histogram/boxplot
2 quan variable: scatter plot
What type of graphical summary can be used for 1 quan and 1 qual variables?
Comparative box plot
Function of histogram
Present the overall distribution of the dataset
Hightlight the data percentage in 1 interval and compare to others
How is block height in histogram calculated?
What is it called?
Density scale
Block height = % in the block/interval length
What is end point convention? Give an example
To decide the inclusion of points that fall on the border
i.e: (20,45] 20 is not included, but 45 is
What information is conveyed from a box plot?
Box plot compares different data sets by presenting “anchor” points including median, middle 50%, outliers
In what case a comparative box plot can be used?
1 quan and 1 qual variable
The quan variable is split up by a qualitative variable