Quantitative data analysis Flashcards
What is statistics and what are two types?
The science of collecting and analyzing data for drawing conclusions and making decisions
1. Descriptive statistics
2. Inferential statistics
What is descriptive statistics?
It is a method of organizing, summarizing and presenting data in a convenient and informative way
- For example through graphs or numbers
What is inferential statistics?
- A branch of statistics that allows us to make predictions, estimates or generalizations about a popluation about a sample
- Statistical inference is the process where we can acquire information about populations
What is the difference between probability and statistics?
Probability is deductive, meaning given the information in a box, you can figure out what is in your hand
Statistics is inductive, meaning given the information in your hand, you can figure out what is in a box
Qualitative (categorical) data representation vs quantitative data representation
Qualitative data representation means data is grouped into non-numerical and descriptive categories, and is then used to compare categories or proportions. Ex: Car colors
Quantitative data representation involves data that includes numbers and measurable quantities, and is used to analyze distributions, patterns or correlations. Ex: Car speeds
Common tools for qualitative data representation
- Summary table
- Bar chart
- Pie chart
- Pareto diagram (innehåller både staplar och linjer)
Common tools for quantitative data representation
- Scatter plot
- Histogram
What are three good practices when presenting data?
- Clearly labeled with title, labeled variables and specified units
- Source of data is identified
- Data have a date
What are 10 good practices when visualizing data?
- Identify target audience
- Make sure the data is clean
- Select the right chart
- Label the chart effectively
- Emphasize the important points
- Choose the best dashboard
- Format your chart for accessability
- Make use of color
- Ensure data is readable in all formats
- Accept feedback
What are two key concepts in numerical representation of data?
- Measure of location: Describe where the data is centered or positioned, key measures to do this are mean, median, mode and quartiles
- Measure of variability and dispersion: Describe how spread out or dispersed the data is around the center, key measures to do this are range, variance, standard deviation, coefficient of variation and box plots
What are box plots and why are they useful?
Shows the median, quartiles and outliers (important!)
- A graphical representation of dispersion, skewness, outliers and other prominent features in data using quartiles
They are useful because if the median is closer to the bottom or top of the box, it suggests skewness
How do you compare boxplots?
- Median: Compare the position of the median lines, higher median = higher central tendency (determine how typical values (medians) differ between datasets)
- IQR (box length): Compare the size of the boxet, longer box = greater variability in the middle 50% of data (more numbers to include –> bigger box)
- Whiskers: Compare length of whiskers, longer whiskers = greater spread in tails (more numbers in data –> longer whiskers)
- Outliers: Compare number and position of outliers, more outliers = presence of extreme values
How to construct boxplots
- Organize the n observations in the data set from smallest to largest
- Separate the smallest half and the largest half
- Find the lower fourth (median of smaller data-half )
- Find the highest fourth (median of larger data-helg)
- Find the fourth spread fs = upper fourth - lower fourth, outliers are >=1.5fs, extreme outliers are >=3fs
What are main concepts in inferential statistics?
- Sampling: Taking a subset (sample) of a population to estimate the characteristics of the whole population
- Estimation: Using sample data to estimate population parameters like mean, proportion, etc
- Hypothesis testing: Testing assumptions (hypotheses) about a population using sample data
- Confidence intervals: A range of values used to estimate the true value of a population parameter with a given level of confidence (e.g. 95%)
What is a sample?
An observed subset of a population