ST102.1 - Data visualisation and descriptive statistics Flashcards
What are the two broad aims of statistical analysis?
- Descriptive statistics: summarise the data which were collected, in order to make them more understandable.
- Statistical inference: use the observed data to draw conclusions about some broader population.
What is the purpose of descriptive statistics?
Descriptive statistics attempt to summarise some key features of the data to make them understandable and easy to communicate. These summaries may be graphical or numerical (tables or individual summary statistics).
What are categorical variables?
Categorical variables (aka, qualitative) take on values that are names or labels. The colour of a ball (e.g., red, green, blue) or the breed of a dog (e.g., collie, shepherd, terrier) would be examples of categorical variables.
- Nominal categorical variables are a type of data used to name variables without providing any numerical value. Coined from the Latin nomenclature “Nomen” (meaning name), this data type is a subcategory of categorical data.
- Unordered categories are nominal data.
- EXAMPLE. This is a nominal variable coded (in alphabetical order) as follows:
- 1 = Africa, 2 = Asia, 3 = Europe, 4 = Latin America, 5 = Northern America, 6 = Oceania. - Ordinal categorical variables are a data type with a set order or scale to it. However, this order does not have a standard scale on which the difference in variables in each scale is measured.
- Ordered categories are ordinal data.
- EXAMPLE. This is an 11-point ordinal scale from 0 (lowest level of democracy) to
10 (highest level of democracy).
What are quantitative variables?
Quantitative data is information about quantities, and therefore numbers and qualitative data is descriptive and regards phenomenon which can be observed but not measured, such as language.
Where is statistical data typically stored?
The statistical data in a sample are typically stored in a data matrix.
- Variables are organised column-wise (x).
- Individual observations (units) are organised row-wise (y).
- The number of units in a dataset is the sample size, typically denoted by n. EXAMPLE. Here, n = 155 countries.
Why would you need to distinguish between uppercase and lowercase N in a data set?
Capital N denotes population size whereas lowercase n denotes sample size.
What are the 2 different characteristics of variables?
- Continuous variables can, in principle, take any real values within some interval.
- EXAMPLE. GDP per capita is continuous, taking any
non-negative value. - Discrete variables if it is not continuous, i.e. if it can only take certain values, but not any others.
- EXAMPLE. region and the level of democracy are discrete, with possible values of 1, 2, . . . , 6, and 0, 1, 2, . . . , 10, respectively.
- However, a discrete variable can also have an unlimited number of possible values.
- EXAMPLE. The number of visitors to a website in a day: 0, 1, 2, . . . .
What is the simplest possibility of a discrete variable?
Many discrete variables have only a finite number of possible values. The simplest possibility is a binary, or dichotomous, variable, with just two possible values.
- EXAMPLE. A person’s sex could be recorded as 1 = female and 2 = male. Or having 1. Yes and 2. No (i.e having two options).
What does the sample distribution of a variable consist of?
The sample distribution of a variable consists of:
- A list of the values of the variable which are observed in the sample.
- The number of times each value occurs (the counts or frequencies of the observed values).
What can we do when the number of different observed values is small in a sample distribution?
When the number of different observed values is small, we can show the whole sample distribution as a frequency table of all the values and their frequencies.
What is relative frequency?
Relative frequency or experimental probability is calculated from the number of times an event happens, divided by the total number of trials in an actual experiment.
- This is a measure of proportion.
- EXAMPLE. Here ‘%’ is the percentage of countries in a region, out of the 155 countries in the sample.
What is a bar chart?
A bar chart is the graphical equivalent of a table of frequencies. The relative frequencies of each region are clearly visible. You can display grouped data here which can be derived from raw data.
What is a histogram?
A histogram is like a bar chart, but without gaps between bars, and often uses more bars (intervals of values) than is sensible in a table.
Histograms are usually drawn using statistical software, such as R. You can let the software choose the intervals and the number of bars.
What is important when you group your data into non-overlapping intervals?
When you need to group your data into non-overlapping intervals you need it to be Mutually Exclusive & Collectively Exhausted (MECE) means that it must belong to one and at most one of these intervals.
- Mutually Exclusive means that individual frequencies belong to at most one interval/group.
- Collectively Exhausted means that individual frequencies must belong to at least one of these intervals/groups.
What is the meaning of different brackets with non-overlapping intervals in frequency tables?
( or ) means that it is up to but not including the value (exclusive), so it is similar to < or >.
[ or ] mean that it is including the value (inclusive), so is similar to =< or =>.
How can you better display the sample distribution on a histogram?
A greater number of intervals on a histogram will better display the sample distribution.
What is skewness and symmetry used for in data presentation?
Skewness and symmetry are terms used to describe the general shape of a sample distribution.