Descriptive Statistics Fundamentals Flashcards
What is Descriptive Statistics?
Descriptive statistics refers to a set of methods used to summarize and describe the main features of a dataset, such as its central tendency, variability, and distribution.
These methods provide an overview of the data and help identify patterns and relationships.
Do you want to learn the appropriate statistics to perform different test?
yes - do you know them?
What are the 2 main ways to classify data?
- Types of data
- Measurement levels
What are the 2 ‘Types of Data’ that you can have?
- Categorical
- Numerical
What is an example of Categorical Data?
A. Car brands like Audi, BMW, Mercedes, etc
B. Answers to Yes and No questions
Example - “Are you currently enrolled in a university?” “Do you own a car?”
What is an example of Numerical Data?
Numerical Data represents numbers. It has two subsets Discrete & Continuous
Numerical Data is a subset of Types of Data or Levels of Measurement?
Types of Data
Types of Data - view
- Types of Data 2. Levels of Measurement
a. Categorical b. Numerical
i. Discrete ii. Continuous
What are the ‘3 Types of Data’?
- Categorical
- Numerical - Discrete
- Numerical - Continuous
What are the two subsets of Numerical Data?
- Discrete
- Continuous
What is Discrete Data?
Something that can be counted in a finite manner. (Absolutely sure the value will be an integer) (it is the opposite of continuous data)
Examples:
“How many children do you want?”
Scores on the SAT
Grades at university
Number of objects
Money as bank notes and coins
What is Continuous data?
Continuous Data is ‘infinite’ and impossible to count. (It can take on an infinite amount of value)
Examples:
Your weight
Height
Area
Distance
Time
A variable represents the weight of a person. What type of data does it represent?
numerical, continuous
A variable represents the gender of a person, What type of data does it represent?
Categorical
What are the 2 “Levels of Measurement”?
- Qualitative
- Quantitative - represented by numbers
What are the two types of Qualitative Data?
- Nominal
- Ordinal
What are examples of Nominal Data?
Categorical data like car brands or like the four seasons (winter, spring, summer, fall)
They are not numbers and cannot be ordered
Definition: (of a role or status) existing in name only.
What are examples of Ordinal Data
Groups and categories that follow a strict order. Data that can be ordered.
Examples:
Likert Scale
Definition: relating to a thing’s position in a series.
“ordinal position of birth”
What are the two groups of Quantitative Data?
- Interval
- Ratio
What is unique about Ratio?
They have a true 0, and intervals don’t
Most things we observe in the world are ratio’s
What are examples of Ratio’s?
Number of objects, distance, price and time
What is the most common Interval variable?
Temperature - it doesn’t have a true zero
Celsius and Fahrenheit are Intervals and have no true zero
Temperature in K is a ratio and has a true zero
A variable represents the gender of a person. What type of data and level of measurement does it represent?
Categorical, Qualitative- Nominal
Gender is a nominal variable. The possible categories cannot be put in any order.
A variable represents the weight of a person. What type of data does it represent?
Continuous, Quantitative - Ratio
Weight is a ratio variable, which means it is a quantitative measure that has a true zero point, signifying the absence of the attribute being measured. In the case of weight, zero signifies a complete lack of weight.
What is the most intuitive way to interpret data?
Visualization
What are some useful ways to visualize categorical variables?
a. Frequency distribution tables
b. Bar Charts
c. Pie Charts
d. Pareto diagrams
What is a Frequency Distribution Table?
A table that has two columns. The type and the corresponding frequency.
frequency - the number of occurrences of each item
What is Relative Frequency?
Relative frequency is the percentage of the total frequency for each category
Example: The percentage of cars sold
All relative frequencies add up to 100%
Reveals the share of the total
ie. Market Share - a good representation is a pie chart
What is a Pareto Diagram?
A Pareto diagram is a special type of bar chart, where categories are shown in descending order of frequency
What does Frequency represent?
the number of occurrences of each item
What is Cumulative Frequence?
Cumulative Frequency is the sum of relative frequencies
It starts as the frequency of the first item and then adds the second item and so on until it finishes at 100%
How do you calculate Desired Intervals?
Largest number minus smallest number divided by number of desired intervals
largest number - smallest
/
number of desired intervals
Desire Interval Width?
5-20
If the frequency of a variable is 20 and its total frequency of all variables is 120, what is its relative frequency?
.17
What is the most common graph to represent Numerical Data?
The Histogram
Why do the bars in a histogram touch?
to show continuation between the intervals. Each interval ends where the next one starts.
True or False - Relative Frequency is made up of percentages?
True
Can histogram’s have unequal widths?
Yes
What are two visualization options to represent relationships between two variables?
- Cross tables
- Scatter Plots
What are cross tables? What do they best represent?
A table where you calculate each row and column. They best represent relationships between two categorical variables. It best represents Categorical data
A variation is the side by side bar chart
What is a scatter plot best used for?
A scatter plot is used when representing two ‘numerical’ variables
-representing relationships between two variables
- best used to get the main idea on how the data is distributed
What is a definition of an ‘Outlier’?
Outliers are data points that go against the logic of the whole dataset
What are the 3 measures of Central Tendency?
Mean, Median, Mode
What are some uses of Central Tendency?
They give you an idea of how the data in a given dataset is distributed.
The mean is the arithmetic average of all numbers. It is very useful because it indicates the average value in the dataset. However, the mean can be flawed because outliers might impact it significantly.
The median is a value at the 50th percentile of the distribution.It disregards outliers and shows you what is in the middle of the distribution.
The mode is the value that is observed most frequently in the distribution. This gives you an idea about the value that reoccurs most often in the dataset.
What is Mean also known as?
The simple average
Denoted as mu (µ) for Population and x-bar (x̄) in Sample
What is the Median?
The middle number in the dataset
What is the Mode?
The mode is the value that occurs most often.
It can be used in both categorical and numerical data
When calculating Mode in a dataset what happens when no number is represented more than once?
We say, there is NO mode
Which Central Tendency measure is best?
The measures should be used together rather than independently. There is no best, but using only one is definitely the worst.
What is Skewness?
Skewness is the most common way to measure asymmetry.
Skewness indicates whether the data is concentrated on one side
What is a Positive or Right Skew?
When the mean > median.
Data points are concentrated on the Left side
(outliers are to the Right. Less data to the Right)
What kind of Skew happens when the Mean, Median and Mode are equal?
Zero or No Skew
the distribution is cymetrical
What is a Negative or Left Skew?
When the Mean < Median
The highest point is defined by the mode.
The outliers are to the left
Why is Skew important?
Skew tells us where the data is situated.
The link between Central Tendency and Probability Theory
What are the 3 main measures of Variability?
- Variance
- Standard Deviation
- Coefficient of Variation
Do you use the same formulas when working with Population Data vs Sample Data?
No - different formulas are used
What does Variance measure?
Variance measures the dispersion of a set of data points around their mean
The closer a number is to the mean the lower the result (variance)
The farther away a number is from the mean the higher the result (variance)
Can never be a negative value
dispersion is about distance and distance cannot be negative
- the result will be large and hard to compare - because it is squared
Which is more meaningful, Std Dev or Variance?
Std dev will be much more meaningful than variance
Are there different formulas for Std Deviation?
Yes, one for population and sample data
What are the formulas for Standard deviation?
Population = sq root of the population variance
Sample = sq root of the sample variance
What is the formula for Coefficient of Variation (CV)?
standard deviation / mean
What is another name for Coefficient of Variation (CV)?
relative standard deviation
What is the most common measure of variability for a single dataset?
standard deviation
Why do we need the measure of Coefficient of Variation (CV)?
comparing the standard of deviation of two datasets is meaningless. Comparing Coefficients of Variation is not.
Why is Standard Deviation preferred measure of variability?
Because it is directly interpretable. It is given in original units. Variance is given in squared units.
Where is Coefficient of Variation (CV) best used?
When comparing the variability of two datasets
What are the 3 univariate measures? (one variable)
- Central Tendency
- Asymmetry
- Variability
What are the two methods to explore the relationship between two variables?
- Covariance
- Linear correlation coefficient
What is the main statistic to measure correlation?
Covariance - it may be positive, negative, or zero
What does the direction of covariance tell us?
> 0, the two variables move together
< 0, the two variables move in opposite directions
= 0, the two variables are independent
What does the correlation coefficient do?
It adjusts the covariance, so that the relationship between the two variables becomes easy and intuitive to interpret.
What is the range of the correlation coefficient?
-1 to +1
What does Perfect Positive Correlation mean?
The entire variability of one variable is explained by the other
Correlation coefficient = 1
What does a Correlation coefficient of Zero mean?
The variables are absolutely independent of each other. The two variables don’t have anything in common.
What does a Negative Correlation Coefficient mean?
The variables move in opposite directions for each other. When one goes up the other goes down.
Is the correlation of x, y = y, x
Yes
Causality - Correlation does not imply causation
It is important to understand the direction of causal relationships
In housing, size causes the price and not vice versa
Causality is an asymmetric relation. (x causes y is different from y causes x)
What is the formula for Correlation Coefficient?
Cov (x,y) / Stdev(x) * Stdev(y)
What are the types of data and the levels of measurement of the following variables: Cust ID, Mortgage, Year of sale
Variable Type of Data Level of Measurement
Cust ID Categorical, Qualitative Nominal
Mortgage Categorical Nominal
Year of Sale Numerical, discrete Interval
Age Quantitative, Ratio - as a whole number is discrete
Price Numerical, Continuous Ratio
Gender Categorical Nominal
State Categorical Nominal
What Excel function is used to calculate Correlation Coefficient?
CORREL()
What Excel function is used to calculate Covariance?
COVARIANCE.S()
When should you disregard correlations?
When the correlation is below 0.2