Chance and data Flashcards

1
Q

Bar graphs

A

Always support statements with statistical data from graphs)

  1. Shape
    (Eg. Both graphs are similar shape because both dot plots are unimodal)
  2. Symmetry
    (Eg. Both dot plots are reasonably symmetrical, but both have a few older competitors which skews the distributions slightly to the right)
  3. Shift
    (Eg. The peak on the athletics graph is located higher up the age scale than that for Swimmers)
  4. Overlap
    (Eg. The ages of the middle 50% of the competitors are much the same)
  5. Centre
    (Eg. The median age of swimmers is younger than the median age of athletics competitors)
  6. Spread
    (Eg. The age range for athletics is larger than that for swimmers)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Writing probabilities

A
  • Probabilities can be written as fractions, decimals or percentages
  • Probabilities can not be less than 0 or greater than 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Converting probabilities

A
• fraction > decimal
(Divide numerator by denominator)
• decimal > percentage
(Multiply decimal by 100)
• percentage > fraction
(Write percentage as a fraction of / 100 and simplify)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Probability equation

Theoretical probability

A

Probability (event) = Number of favorable outcomes / Total possible number of outcomes

(Number of favorable outcomes is how many times the result should occur)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Probability equation

Experimental probability

A

Probability (event) = Number of outcomes / Total possible number of outcomes

(Number of outcomes is how many times the result did occur)

• This is for when the probability of an event is difficult or impossible to calculate. Many trials are done and the amount of times an event occurs is recorded. The true value of the probability will not be known, but the greater the number of trials, the closer the estimated probability will be to the actual probability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Expected number of outcomes equation

A

Expected number of outcomes = Probability (event) x Number of trials

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Combining probabilities
(Probability tree)

For calculating probabilities where several events occur

A
  1. Make a probability tree by deciding what the events are, and in what order they occur. Write the events at ends of the branches.
  2. Write the probabilities of each event on the middle of each branch, and check that these each add to 1.
  3. Calculate the probabilities at each end by multiplying the probabilities along each branch, and write the probability at the end of each branch.
  4. To find the overall probability for a question…
    If one event occurs OR another event occurs > Add the probabilities
    If one event occurs AND another event occurs > Multiply the probabilities.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Combining probabilities
(Two-way frequency table)

For calculating probabilities where several events occur

A

• When given a table with probabilities, first calculate the actual numbers and then fill them in using the number of the entire population given.
• To find the overall probability for a question…
If one event occurs OR another event occurs > Add the probabilities
If one event occurs AND another event occurs > Multiply the probabilities.

(Tick the boxes to which they apply to help)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data handling

A

When collecting data we take a sample from a population

• A sample of 30 is considered to be sufficient for most purposes. A larger sample means you can have more confidence in findings.
• Bias occurs when some members of the population are more likely than others to be selected for the sample so that it does not accurately represent the population.
(Eg. ‘self-selected’ samples)
• To avoid bias, every member of the population has an equal chance of being sampled
(Eg. ‘random’ samples)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

‘Self selected’ samples

A

‘Self selected’ samples occur if a member of the population decides whether they will be selected or not.

Eg. Ringing a radio station, filling in a form, going to a website to give feedback, completing a survey
(People may choose not to respond, and only those with an interest in the topic of the survey will be in the sample)

Eg. Surveying in a particular location
(Only those who go to that location, have time to stop and answer will be in the sample)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

‘Random’ samples

A

‘Random’ samples occur if every member of the population has an equal chance of being selected.

Eg. Writing names on equal-sized pieces of paper and drawing them from a hat, giving every member of the population a number and using random numbers to decide who is selected, using random numbers to decide who will be selected from the electoral roll, or selecting every (5th) person as the (____).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Measures of center

A

Measures of center give a measure of where the middle of a distribution lies.

Mean, median, mode

• The median is middle data value and the best measure of centre for the data, as it is not distorted by very large or small values and is clearly able to be calculated for each set of data. Whereas, the mean is the sum of all data values / the total number of data values which represents the average data value and is therefore distorted by very large or small values. The mode is the data value which occurs most frequently, which is also unreliable as a measure of centre as often there are two or no modes (if there are more than 2 modes, there is no mode).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Measures of spread

A

Measures of spread give a measure of how widely spread the data is.

Upper quartile, lower quartile, inter-quartile range, range

• The inter quartile range is the difference between the upper and lower quartiles (IQR = UQ - LQ) and the best measure of spread for the data, as it is not distorted by very large or small values. Whereas, the range is the difference between the maximum and minimum values and is therefore distorted by very large or small minimum and maximum values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Upper quartile and lower quartile

A

(UQ) The upper quartile is the middle data value of the top half of the data
(LQ) The lower quartile is the middle data value of the bottom half of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Displaying data

A
  1. Dot plots
    A visual representation of each data point
  2. Box plots
    A visual representation of each 25% of the data
    (Useful for representing data and comparing sets of data, but it does not show the distribution of all the data points and it is affected by very large or small values)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to describe dot plots/box plots

A

Comment on…

  1. Mean and median (data distribution)
  2. Whisker length/skew OR Modes > 1
  3. Gaps and clusters in data (if relevant)
  4. This shows… (in context to data, eg. women are better than men)
17
Q

Describing dot plots/box plots

If mean and median are close) (mean, skew

A
  1. The mean (Mean) and the median (Median) are very close because the data is distributed fairly symmetrically around the mean.
  2. The length of the right whisker is much longer than the left because of just one value (Maximum value).
  3. The data is clustered in the center because the gaps within the box are smaller than the lengths of the whiskers, which means the lower quartile, median and upper quartile are close together (small gaps).
18
Q

Describing dot plots/box plots

If mean and median are close) (median, even

A
  1. The mean (Mean) and the median (Median) are very close because the data is distributed fairly symmetrically around the median.
  2. The dot plot shows the data is bimodal (has 2 modes). This is not shown by the box and whisker plot.
  3. The data is fairly evenly spread because there is a gap in the middle of the data, which means the minimum, lower quartile, median, upper quartile, and maximum are fairly evenly spread.
19
Q

Describing dot plots/box plots

If mean and median are not close

A
  1. The mean (Mean) is larger than the median (Median) because the ‘tail’ of bigger numbers to the right increases the mean, but not the median.
  2. The data is skewed to the right.
  3. The data is mostly clustered to the left, because the length of the right whisker is larger than the left, which means the minimum, lower quartile and median are close together (small gaps).
20
Q

Drawing conclusions from box plots
(for the ‘lengths of individuals)

A. |——-| | |————|
B. |———-| | |———|

  1. If the boxes of each sample overlap with each other and both medians lie within the box of the other sample.
A

(comment on center)
The (lengths) in sample B tend to be greater than those in sample A, because the box in B extends further to the right than that of A. This is supported by the median in B (__) being bigger than the median in A (__).

(comment on spread)
Although the range is the same for both samples (__), the IQR for B (__) is larger than that of A (__), so the (lengths) of the middle 50% in B are more spread than those in A.

For the population :
We can not conclude that (individuals) in population B tend to be (longer) than (individuals) in population A because both of the medians lie within the middle 50% of the other sample, and the boxes overlap.

21
Q

Drawing conclusions from box plots
(for the ‘lengths of individuals)

A. |——-| | |————|
B. |————–| | |———|

  1. If the boxes of each sample overlap with each other and one or both medians lie outside the box of the other sample.
A

(comment on center)
The (lengths) in sample B tend to be greater than those in sample A, because the box in B extends further to the right than that of A. This is supported by the median in B (__) being bigger than the median in A (__).

(comment on spread)
Although the range is the same for both samples (__), the IQR for B (__) is larger than that of A (__), so the (lengths) of the middle 50% in B are slightly more spread than those in A.

For the population :
We can conclude that (individuals) in population B tend to be (longer) than (individuals) in population A on average because although the middle 50% of each sample overlap, both medians lie outside the boxes of the other group.

22
Q

Drawing conclusions from box plots
(for the ‘lengths of individuals)

A. |——-| | |————–|
B. |——————-| | |———|

  1. If the boxes of each sample do not overlap with each other and both medians lie outside the box of the other sample.
A

(comment on center)
The (lengths) in sample B tend to be greater than those in sample A, because the box in B extends further to the right than that of A. This is supported by the median in B (__) being bigger than the median in A (__).

(comment on spread)
The IQR for B (__) is larger than that of A (__), so the (lengths) of the middle 50% in B are more spread than those in A.

For the population :
We can conclude that (individuals) in population B tend to be (longer) than (individuals) in population A on average because the boxes do not overlap and both medians lie outside the boxes of the other group.

23
Q

Time series

A

Draw a trend line

  1. State the overall trend
  2. Describe seasonal variation
    • State the variation around the trend line (remains consistent, increases/decreases)
  3. Unusual features
    Limitations of data

(If two graphs are given, make sure comments are comparative).

24
Q

Time series

  1. State the overall trend
A
(If trend is positive) 
For both (graph 1) and (graph 2), the overall trend is increasing over the time frame. This is because, the (y axis value) increases by approximately (\_\_) in (graph 1) and increases by approximately (\_\_) in (graph 2) over the (\_\_) years shown. This increase is greater for (y axis value) in (graph 1). 
(If trend is constant)
For both (graph 1) and (graph 2), the overall trend is steady over the time frame. This is because, the (y axis value) increases and decreases by approximately (\_\_) over the (\_\_) years shown.
25
Q

Time series

  1. Describe seasonal variation
A

(If both graphs have similarly shaped peaks)
For both graphs, the shape of yearly patterns have jagged peaks which shows the same pattern of increase and decrease. This is because, the number of (y axis value) peaks in (Q3) and dips in (Q1) each year.

(If both graphs have differently shaped peaks)
For (graph 1), the shape of the yearly patterns have flat
peaks and are not jagged like the peaks for (graph 2) which shows a different pattern of increase and decrease. This is because, (graph 1) data peaks in (Q4) and (Q1), whereas (graph 2) data only peaks in (Q3) each year.

This may have occurred because…
(see data limitations)

26
Q

Time series

  1. Unusual features
A
  1. State what the unusual feature is
  2. Explain feature using specific numbers and statistics from the graph

Eg. There is a sharp increase in the middle of the graph. This is because the number of people going to Australia for holidays rose sharply during 2004-2005 and then remained elevated.

  1. This may have occurred because…
    (see data limitations)
27
Q

Predicting time series with trend line

A

To draw a trend line, project lines across the top and bottom of the data, then split it in half (following the shape of previous years data)

• I do/do not feel very confident in my prediction because…
(see data limitations)

28
Q

Bivariate data

A

Bivariate data is when there is two pieces of numeric data about every individual in a sample or population. Bivariate data is plotted on a scatter graph to see if there is a relationship between them

29
Q

Scatter graph questions

Calculate the average x, y value

Find the approximate median

A

(Solve for gradient of line of best fit)
• Gradient = rise / run = __ / __
• By drawing a line of best fit and calculating the gradient, we can see how the average (x, y value) is (___) (y value unit) per (x value unit). This is because the gradient represents how many (y value) changes for every (x value unit) increase in (x value).

(Count total number of points)
• There are (__) points in total, halfway between these points is (half of first value) points up, which is (__) on the (x/y axis value depending on question) for (x/y axis value depending on question). This indicates the approximate median of (___).

30
Q

Format for answering questions

A
  1. Identify the problem
  2. Explain why there is a problem
    Use specific numbers and statistics from the graph
  3. Discuss improvements and assumptions to data
31
Q

Scatter graph

A

Draw line of best fit

  1. State the relationship
  2. Describe the relationship
    • State the variation around the trend line (remains consistent, increases/decreases) and how close points are to line of best fit
  3. Unusual points
  4. Groups
    Limitations of data
32
Q

Scatter graph

  1. State the relationship
A

(If gradient is positive)
• There appears to be a positive linear relationship. As (x axis value) increases, (y axis value) increases and vice versa.

(If there is no clear line of best fit)
• There appears to be no relationship between the (x axis value) and the (y axis value). This is seen by the way that the dots do not follow any line, and the (y axis values) for each (x axis value) has a large range from (__) to (__).

33
Q

Scatter graph

  1. Describe the relationship
A

(If points are close to the line of best fit)
• Most points are concentrated close to the line of best fit, which means there is a strong relationship between the (x axis value) and the (y axis value). This means the linear model is appropriate for (graph 1), and this data would be useful for calculating and predicting the (y axis value) from the (x axis value).

(If points are not close to the line of best fit)
• Most points are spread away from the line of best fit, which means there is a weak relationship between the (x axis value) and the (y axis value). There is increased variation/significant variation in the amount of the (y axis value) as the (x axis value) increases. This means the linear model is not appropriate for (graph 1), and a non-linear model may better represent the relationship between (x axis value) and (y axis value). This also means the data would not be useful for calculating and predicting the (y axis value) from the (x axis value).

34
Q

Scatter graph

  1. Unusual points
A

• Unusual points are some distance away from most other points and the trendline. Sometimes they are valid, otherwise, they may be a result of measurement error, recording error or reversed coordinates.

  1. State what the unusual points are using coordinates and specific numbers from the graph.
  2. Explain why they are outliers
  3. This may have occurred because…
    (see data limitations)

Eg. There are 3 unusual values (outliers), one at $10 500, one at $16 000 and one at $17 000. These 3 points are much higher than the others. This would need to be
investigated, as they may be due to brand, mileage, wear and tear, etc.

35
Q

Scatter graph

  1. Groups
A
  1. State where points are clustered according to the (x axis value) and (y axis value)
  2. Explain why these values may be common
  3. This may have occurred because…
    (see data limitations)
36
Q

Limitations of data

A

• Consider what factors affect results of data.
(maybe annual changes, seasonal changes due to weather, weekly changes due to weekends, maybe daily changes due to night/day light or enviromental conditons)

  • Consider the size of the data sample and whether the data accurately represents the entire population.
  • Consider the conditions of the ‘test’ for each sample and whether the test accurately represents real life.
  • Consider whether there is a bias of samples in data.
  • Consider outliers that may have been wrongfully considered.
  • Consider whether the trend is consistent
37
Q

Possible ways to improve data

A
  • Select equal numbers of males and females from results collected so far (if not equal numbers were used)
  • Could use percentages or relative frequency on the vertical scale