Organising, Visualising and Describing Data Flashcards
3 Classes of Data Types
- Numerical (quantative) vs Categorical (qualitative)
- Time series vs X-Sectional
- Structured vs Unstructured
What are the 2 types of Categorical Data
Nominal - no logical order
Ordinal - logical order
Can you perform mathematical operations on categorical data?
No
What is the difference between Time Series and x-sectional data?
Time series is a set of many observations.
X-Sectional- one specific point in time, a set of comparable observations are made.
What is Panel Data?
Combine x-sectional and time.
Unstructured data can be classified according to how the data is generated. Give an example
Individuals (social media post)
Processes (withdrawal)
Sensors (Camera)
Define the following:
1.Absolute, Relative and Cumulative Frequency
nominal, % and adds up to 100
Absolute- histogram
What is a joint frequency and what is a marginal frequency ?
Joint - data cell of a contingency table (two-dimensional array = a normal table with columns and rows). Basically when the 2 variables (row and column label) occur simultaneously.
Marginal - Total frequency for a row or column.
What is a contingency table and what is a confusion matrix?
Table to analyse 2 variables
A confusion matrix is an example of a contingency table. One variable is predicted…. and the other variable is actual… . so shows actual vs predicted
Benefits of a : Histogram, Frequency Polygon, Cumulative frequency distro chart, Bar Chart, grouped bar chart or clustered bar chart, a stacked bar chart
- Quickly see where the concentration lies
2.Joins the midpoints of the histogram intervals - can be either relative or absolute
4.Illustrate RELATIVE sizes/degrees/magnitudes. - can illustrate 2 categories at once (adds another variable)
- shows both the cumulative and joint frequency in the same bar
Benefits and features of: A Tree Map; word cloud; line charts; bubble line chart; scatter plot + scatter plot matrix (3 variables); Heat Map
- Visualise relative size of categories
- Visualise text - categorical data
- illustrate time series data. Can plot multiple lines if scale is comparable
- adds another dimension to a line chart, each point has a bubble that is in proportion to its variable
- shows the relation between 2 variables and the strength of it.
- Is drawn off of a contingency table and uses colour to visualise the concentration of data.
Place all charts you can think of into the following 3 categories: Relationships, Comparisons and Distributions. Can be more than one
Relationships: Scatter/scatter plot matrix, heat maps
Comparisons: Bar chart, tree maps, heat maps, dual line charts, bubble line charts
Distributions: Histogram, frequency polygon, cumu distro charts, bar charts, tree maps, heat maps for categorical data, word clouds for text data
Formula for u(population mean) and X(sample mean) and What is an Arithmetic Mean and its 4 properties?
u=sum of all observations/no. of obs
X= same but for sample
An arithmetic mean = sum of observations/no. of obs
1.All interval and ratio data sets have an arithmetic mean
2.all data values are considered and included in the arithmetic mean
3. a data set has only one arithmetic mean
4. the sum of the deviations of each data point in the set will sum to 0. so: sum of data points (Xi-X)=0
2 techniques to deal with the pitfalls of the arithmetic mean
Trimmed Mean- a 1% trimmed mean would exclude the top and bottom 1/2%
Windsorized Mean - substitue out data rather than exclude it (doesn’t sound very good)
How do you calculate Weighted Mean
Xw=(w1X1+w2X2+….+wnXn)
e.g. A portfolio consists of 50% common stocks, 40% bonds, and 10% cash. If the return on
common stocks is 12%, the return on bonds is 7%, and the return on cash is 3%, what is the
portfolio return?