Chapter 1: Statistics Flashcards

1
Q

What is a measure of central tendency?

A

The centralpoint around which the data seems to be clustered.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the most important measure of central tendency?

A

The arithmetic average

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a measure of dispersion?

A

How closely clustered data points are to the central tendency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the most important measure of dispersion?

A

Standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The bigger the SD the bigger the what?

A

The bigger the standard deviation figure the bigger the level of dispersion around the arithmetic mean. In other words, the bigger the standard deviation, the more spread out the data will be.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is primary data?

A

Data that an investigator has collected themselves.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the advantages of primary data?

A

The investigator knows the conditions under which the data was collected and is aware of any limitations it may contain.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is secondary data?

A

Secondary data is collected by many organisations, such as companies, government agencies and other bodies which have been formed specifically to gather economic and social data in a convenient form. The Office for National Statistics (ONS), for example, collects economic data on inflation and employment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the disadvantages of using secondary data?

A

Users of secondary data may not have a full understanding of the background and circumstances under which the data was initially collected. Consequently, users of secondary data may be unaware of any limitations it may contain.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some other sources of secondary data?

A
  • Bank of England
  • HM Treasury
  • Credit rating agencies, such as Fitch, Moody’s and S&P
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is discrete data?

A

Data where the units of measurement cannot be split up. For example, if the data refers to the number of people using a particular tube station each day, then the recorded figures might be 824 or 825 people, but never 824 ½.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is categorical data?

A

Data can be put into groups or categories, for example, the answers to a question could be coded 1 for yes, 2 for no and 3 for maybe. This process separates the responses to form categorical data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are descriptive statistics?

A

Descriptive statistics are used to describe the basic features of the data. They provide simple summaries about the sample and the measures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Is it typically possible to apply descriptive statistics to categorical data?

A

No! It is not generally possible to directly apply descriptive statistics to categorical data as the actual number itself is arbitrary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is ordinal data?

A

Categorical data may be ranked or ordered according to set criteria, e.g. a first or second class degree. It is the order of these numbers that matters. This is known as an ordinal data. Ordinal data allows for the use of descriptive statics to compare the data using numbers and scales.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is continuous data?

A

Continuous data is where the units have a constant scale and all points between the units have meaning. For example, the distance travelled by a person to work can be expressed as 5 miles, 5.1 miles, 5.12 miles and so on, to an unlimited number of decimal places.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does the level of accuracy in recording continuous data depend on?

A

The precision of the measuring device itself.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a population?

A

A population is the entire set of items which have the desired characteristics under investigation. For example, if the TV viewing habits of males under 40 years of age was under investigation, then the population refers to all males under 40 years of age

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is an advantage and disadvantage of using a population?

A

A population will give a complete set of data but will be very difficult and time consuming to collect.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is a sample?

A

A sub-set of items taken from the population with the characteristics under investigation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the two ways a sample can be selected?

A

On a random or non-random basis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is a random sample?

A

A sample selected in such a way that every member of the population has an equal chance of being selected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is a random sample?

A

A sample selected in such a way that every member of the population has an equal chance of being selected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is another name for non-random sampling?

A

Non-probability method of selection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is quota-sampling?

A

It is non-random selection which is often used in market research. Such a quota is usually categorised into different types of individual members, e.g. professional or manual workers, with ‘sub-quotas’ for each type.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Explain sampling vs. quota sampling vs. stratified sampling

A

Sampling might involve interviewing the first 100 people an investigator meets in a city centre, (i.e. the quota). Quota sampling might involve using data on the first 52 women and 48 men interviewed in order to reflect the gender split of the UK. If the 52 women and 48 men were selected randomly this would be called stratified sampling. Stratified sampling is designed to reduce sampling error, it does this by selecting a sample that represents the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is systematic sampling?

A

It’s another form of non-random sampling. This is where researchers select the nth record of a population. For example, if analysing how far your employees travel to work on average, we may ask every fifth person on an alphabetical list of employees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is convenience sampling?

A

Choosing the sample that is easiest to collect information from. Choosing people in your local town to represent the UK, for example.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is judgement sampling?

A

Making a judgement of the sample that would best represent the population, for example, believing Swindon is a good representation of the UK.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is snowball sampling?

A

This is typically used when the subjects of the data are rare. It relies on referrals from initial subjects.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is a relative frequency distribution table?

A

A relative frequency distribution table allows us to see the category in comparison with the total frequency. Each frequency is calculated as a percentage of the whole.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What are the two main methods used to present discrete data?

A

bar charts and pie charts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What are the 4 main methods of visually presenting continuous data?

A
  1. Histograms
  2. Time series graphs
  3. Semi-log graphs
  4. Scatter diagrams
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Histograms and bar charts look similar, but what is the difference between the two?

A

The area (not the height) of the bar on a histogram represents the frequency of occurrence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is a time-series graph?

A

A time series graph displays the path of a variable (e.g. a share price) in chronological order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is a log (semi-log) graph used to illustrate?

A

A (semi-) log graph is used to illustrate the rate of change of a variable. A log graph is constructed in order to determine the rate of acceleration over time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What are the 2 key measures often used in descriptive statistics (a.k.a summary statistics)?

A
  1. The ‘typical’ value contained within the data set, i.e. the measure of central tendency
  2. How widely spread-out the set of data is, i.e. the measure of dispersion
37
Q

What are descriptive statistics used for?

A

They are used to compare two (or more) sets of populations and/or samples. The aim of descriptive statistics is to efficiently summarise large quantities of data to simplify the process of comparing samples and/or populations.

38
Q

Write the formula for the arithmetic mean?

A
39
Q

What does the standard deviation measure?

A

The level of distribution, i.e. dispersion, around the mean of a set of data.

40
Q

Write out the formula for SD.

A
41
Q

What are the limitations of using sample data?

A

The sample standard deviation is used as a measure of dispersion for samples of data. The limitations of small data sets include the fact that they may not be a good representation of the population as a whole, as they are likely to miss out extreme values within the data.

42
Q

How is the sample standard deviation calculated?

A

To calculate the sample standard deviation, a slight adjustment to the standard deviation formula is made; the number (‘n’) of values is reduced by one. The overall effect of this adjustment is to cause the sample standard deviation to be greater than the standard deviation of a set of values.

43
Q

What is the simple variance?

A

The name given to the square of the sample standard deviation.

44
Q

What are the problems of using mode and range for descriptive statistics?

A

As a measure of central tendency, the most obvious problem with the mode is that a set of data may not contain a mode at all. Alternatively, there may be more than one mode in a data set, i.e. ‘bi-modal’ (two modes) or ‘tri-modal’ (three modes).

The main problem with using the range as a measure of dispersion is that it is distorted by extreme values.

45
Q

How is the median calculated if the data set has an even number of values?

A

The median is equal to the average of the two middle items.

46
Q

What is the inter-quartile range also know as?

A

The ‘second quartile’

47
Q

What is the ‘first quartile’?

A

The item between the start of a series of numbers and the middle is the ‘first quartile’.

48
Q

What is the ‘third quartile’?

A

The middle item between the median and the end of a series of numbers is known as the ‘third quartile’.

49
Q

What is the inter-quartile range?

A

The inter-quartile range is value at the third quartile minus the value at the first quartile. The inter-quartile range is the ‘spread’ of the middle 50% of items in a data set. It is not distorted by extreme values as the top and bottom of the series is removed.

50
Q

What are probability distributions?

A

Probability distributions make use of relative frequency as a measure of probability.

51
Q

In a normal curve of distributions what 4 probabilities can be inferred?

A
  1. 50% of observations will fall either side of the mean
  2. Approximately 68.26% of observations in the distribution will be within 1 standard deviation either side of the mean
  3. Approximately 95.5% of all observations will be within 2 standard deviations either side of the mean
  4. Approximately 99.75% of all observations will be within 3 standard deviations either side of the mean
52
Q

When extreme events occur more frequently than is predicted by the normal distribution. What is the distribution referred to as having?

A

Fat tails

53
Q

When are the frequency distribution curves positively skewed?

A

When the peak of the curve lies to the left of centre.

54
Q

When are frequency distribution curves negatively skewed?

A

When the peak of the curve lies to the right of the centre.

55
Q

What is the order of the mean, median and mode in a negatively skewed distribution?

A

Alphabetical order

56
Q

What is the order of the mean, median and mode in a positively skewed distribution?

A

Reverse alphabetical order

57
Q

How can the probability of a future prediction be evaluated?

A

The probability of each prediction can be evaluated using a test statistic, or a null hypothesis.

58
Q

Define the geometric mean.

A

The geometric mean measures the average rate of change over a given period.

59
Q

When is the geometric mean useful?

A

It is particularly useful when looking at compound changes, such as changes in a share price or changes in portfolio returns.

60
Q

What is the covariance?

A

The covariance (cov) is a statistical measure of the relationship between two variables, e.g. two share prices.

61
Q

When is the covariance positive?

A

When variables move in the same direction.

62
Q

When is the covariance negative?

A

When variables move in opposite directions.

63
Q

When is the covariance zero?

A

When two variables are independent of each other.

64
Q

What does the correlation coefficient measure?

A

The strength of the relationship between two variables, such as two share prices.

65
Q

What is the range of the correlation coefficient?

A

The correlation coefficient can range from -1 (perfectly negative) to +1 (perfectly positive) through a zero point (uncorrelated).

66
Q

Define positive correlation.

A

Positive correlation describes a relationship where an increase in one variable is associated with an increase in another, such as the frequency of advertising with number of sales.

67
Q

Define negative correlation.

A

Negative correlation describes a relationship where an increase in one variable is associated with a decrease in another, such as sales of umbrellas and sales of sun-tan lotion.

68
Q

Define perfect correlation.

A

Perfect correlation describes a relationship where changes in one variable are reflected by a proportional change in another, such as, mass and weight.

69
Q

Define autocorrelation.

A

Autocorrelation measures the relationship between an asset’s past performance and current/future performance. For example, the correlation of returns of share A in 2015 with the returns of share A in 2016. This correlation can then be used to predict future behaviour of the asset.

70
Q

How is diversification achieved?

A

By combining securities which are not perfectly positively correlated.

71
Q

How is risk reduction through diversification achieved?

A

By combining assets with a low (or negative) correlation of returns.

72
Q

The lower the correlation of returns……

A

The lower the correlation of returns, the greater the fund’s diversification and the lower the risk associated with an expected level of return. For example, a fund manager choosing two securities with a perfectly negative correlation of returns achieves a risk-free portfolio.

73
Q

What is the only instance when no diversification benefits are achieved?

A

When assets have a perfect positive correlation of returns.

74
Q

Explain what is meant by ‘correlation does not necessarily imply causation’

A

It should be noted that whilst correlation tells us that two variables have moved in similar patterns previously, it does not necessarily mean that one variable is causing the other to change or vice versa. It could be that both variables are being affected by a third factor or even that the pattern is a coincidence.

75
Q

What is data mining?

A

Data mining is the use of large amounts of information to try to discover relationships.

76
Q

What is the major advantage and disadvantage of data mining?

A

Data mining can be a valuable way to identify previously hidden causation, but the large scale indiscriminate processing of data means that correlations may be purely coincidental.

77
Q

How do extreme events affect correlation?

A

In times of extreme market conditions, established relationships can break down and many assets become strongly positively correlated. We see this in times of economic uncertainty and crisis, and in times of great optimism.

78
Q

How is the correlation coefficient calculated?

A

Correlation coefficient is calculated by dividing the covariance of the two assets by the product of their standard deviations. By virtue of its calculation, correlation will always be between +1 and -1.

79
Q

What is the formula for the correlation coefficient?

A
80
Q

What are scattergrams used to determine?

A

Scattergrams are used to determine whether there is a relationship (correlation) between two variables.

81
Q

What is the purpose of a scattergram?

A

The purpose of a scattergram is to demonstrate whether there is any pattern among the plotted points.

82
Q

What is linear regression?

A

A ‘line of best fit’ can be used to indicate a pattern in the data points. The process is called ‘linear regression’.

83
Q

Explain the least squares method.

A

The ‘least squares’ method is used to plot a line across the middle of all of the points. This approach minimises the ‘sum of the squares of the distances’, this places the line in the centre of the data points. It is considered to be the best linear unbiased estimator (or BLUE) of the line of best fit.

84
Q

What happens in a scatter diagram when variables are perfectly correlated?

A

When variables are perfectly correlated all the points on a scattergram will lie on the line of best fit.

85
Q

What does bivariate linear regression predict?

A

Outcomes on the y-axis

86
Q

What is the equation for bivariate linear regression?

A

y = a + bx

Where:
• y is the dependent variable
• x is the independent variable
• a is the intersect with the y-axis
• b is the coefficient, often referred to as the gradient of the line of best fit

87
Q

How is the line of best fit calculated?

A

The line of best fit is calculated using the Least Squares Method, which minimises the sum of the errors squared. This is also referred to as the residual sum of squares.

88
Q

What is R-Squared?

A

R-Squared is a coefficient of determination and gives us an impression of the accuracy of our forecasts, using our models or benchmarks, and what remains unexplained. R2 value ranges from 0 to 100 where the higher the number, the more accurate the predictive power.

89
Q

What is the benefit of adjusted R Squared?

A

Multifactor linear regression models add more independent variables (x-variable) to predict the dependent variable (y). The act of adding more factors increases the r-squared – but this does not necessarily mean the model is getting better.

Adjusted R-squared compensates for this phenomenon by only increasing R-squared, if the new factor improves the measure by more than pure chance. If it improves the model by less than pure chance, the adjusted R-squared will fall.