Organizing, Visualizing, and Describing Data Flashcards
Data (Definition)
a collection of numbers, characters, words, text that represent FACTS or INFORMATION but NOT KNOWLEDGE (but analysis and interpretation on the facts and information develops knowledge)
What are the two main types of Data?
Numerical (quantitative) and Categorical (qualitative)
What is the definition of Categorical data?
Values that describe a quality or characteristic (mutually exclusive labels or groups)
What are the two types of Categorical data types?
Nominal and Ordinal
What is the definition of nominal data?
No logical order (e.g. sectors of the economy)
What is the definition of ordinal data?
Has logical order or rank (note that there is no information in the distance between groups)
What is the definition of Numerical data?
data that is measured or counted quantities
What are the two types of Numerical data?
Integer/Discrete - limited to a finite number of values (number of people)
Ratio/Continous - can take on any value within a range
What does NOIR stand for and how does it related to data types?
Nominal
Ordinal
Integer/Discrete
Ratio/Continuous
Define variable
a particular quality or characteristic (Stock price, height)
Define observation
a value of a specific variable (GM $53.30 and Trish is 5’9”)
Define cross-sectional data
multiple observations of a particular variable (the stock price of 60 companies)
Define time series
multiple observations of a particular variable for the same observational unit overtime // one unit and multiple observations (GM’s stock price over the last 60 months)
Define panel data set
cross-sectional and time-series combined
Define structured data
Highly organized in a pre-defined manner (stock prices, returns, EPS)
Define unstrucuture data
no organized form (news, social media post, company filings, audio/video)
Define absolute frequency
the actual count of observations per value of the variable
Define relative frequency
Percentage of observations per value of the variable which is the absolute frequency divided by total N)
How to create non-overlapping bins
Sort data in ascending order
Find the range: max-min
Decided on the number of intervals (which is K)
Calculate the interval width by dividing the range by k (always round up)
Add the internal to the first value and so on
What is a Contingency table
it’s a table that summarizes data for 2 or more categorical variables (helps visually find patterns)
What does a histogram or frequency polygon show?
represents the distribution of numerical data (y-axis shows frequency and x-axis shows intervals/values)
What does a bar chart show?
Represent the frequency distribution of categorical data
What does a tree map show
a set of coloured rectangles to represent groups
What is a line chart used for?
Used to visualize ordered observations
Typically used for time series data
Facilitates showing changes and underlying trends
What is a scatter plot used for?
Used to visualize the joint variation in 2 numerical values
What is a heat map?
It is a contingency table with color-coded cells
It can also be used to visualize the degree of correlation among different variables
Define “measures of central tendency”
Measures of central tendency specify where data are centered (arithmetic mean, median, mode, weighted mean, geometric mean, harmonic mean)
Define “measures of location”
they are deciles, quantiles, quintiles
Define median
Median is the middlemost value of a set of observations.
It is not affected by extreme values (i.e. outliers).
It is useful for describing the central tendency for a non-symmetrical distribtuion.
If the distribution is perfectly symetrical, then the mean equals the median.
How do you calculate median?
For an even number of observations: (n+1)/2
For an odd number of observations: (n/2 + ((n+2)/2) )/ 2
Define mode
The most frequently occurring value in a distribution
When there is no mode, then the observations are uniformly distributed.
This is the only measure of central tendency that can be used with nominal data
When is geometric mean used?
Used with rates of change over time or to compute growth rates
Is the arithmetic mean always greater than the geometric mean?
Yes
What is the formula for geometric mean?
( ( (1 + R1) x (1 + R2) x (1 + R3) ) ^ (1 / N) ) - 1
How are geometric mean and arithmetic mean related?
Xg = Xa - Variance/2
What is an advantage of the harmonic mean?
It gives much less weight to outliers
What is it appropriate to use the harmonic mean?
It is appropriate for averaging ratios when the ratios are repeatedly applied to a fixed quantity to yield a variable number of units (i.e. dollar-cost averaging)
What is the formula for harmonic mean?
n / sum of all (1 / Xi)
n = number of observations and
Xi = the specific value for each observation.