Data Unit 2 Test Flashcards

1
Q

Scatter Plots:

A

Graph used to determine if there is a relationship between two variables. The independent variable is on the horizontal axis and the dependent variable is on the vertical axis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Line of Best Fit (trend line)

A

A straight line that passes as close as possible to all the points in a scatter plot.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The line is drawn with the following criteria in mind:

A
  1. The line passes through as many points as possible.
  2. There are evenly distributed points above and below the line (the sum of the perpendicular distances for all the points above the line should equal that of the points below it).
  3. Ignore outliers, whenever possible.
  4. Consider the origin as a possible point (e.g. extrapolate to time 0)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Outliers:

A

Data or points that lie significantly away from the majority of the other data. They can skew a regression analysis, especially when the collected data is small. More information should be sought about the outlier before including or excluding it from the analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Correlation:

A

in data analysis, one variable may be affected by another variable and when a change in the independent variable affects the dependent variable, there is a correlation between them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

To describe a correlation, we use 3 attributes:

A
  1. Linear (clear line), Non-linear (curved), No correlation (scattered points)
  2. Positive (positively sloped), Negative (negatively sloped)
  3. Strength; Strong (or high), Moderate (or medium), or Weak
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Linear Correlation:

A

Variables have a linear correlation if the changes in one variable tend to be
proportional to changes in the other variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Correlation Coefficient:

A

Gives a quantitative measure of the strength of a linear correlation or a measure of how closely the points of a scatter plot is to the line of best fit. It is the covariance of the two variables divided by the product of the standard deviations of each variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Negative Linear Correlation Values

A

Strong is -1 to -0.67
Moderate is -0.67 to -0.33
Weak is -0.33 to 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Positive Linear Correlation Values

A

Weak is 0 to 0.33
Moderate is 0.33 to 0.67
Strong is 0.67 to 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Steps for using the Correlation Coefficient Formula

A
  1. Create a chart with 5 columns (x, y, x^2, y^2, xy)
  2. Calculate the sums of each column (by adding each value)
  3. Plug in the values and solve with BEDMAS
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Linear Regressions in Desmos

A
  1. Click on the + sign and insert in a Table
  2. Type in your x and values
  3. On the next line type y1~mx1+b and the r and r^2 will appear
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

You can find the equation of the line of best fit by

A

subbing in the m and b that Desmos gives you. Just remember to write x after the m because ts y = mX +b.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

A positive relationship between two variables will sound like this:

A

as the # of minutes increases, the cost also increases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

To find unknown y values that are within our data sets we can either

A

1) sub that x into the equation and solve or 2) we can hover over the x value on desmos and get the y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Interpolation is

A

data within our data set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Extrapolation is

A

data beyond the data set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

To find unknown x values that are within our data set we can either

A

1) sub in the y and solve or 2) hover over the trendline as close to the y value as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

A point can be rejected as an outlier if

A

1) There was an error collecting data, or outside factors affected the data,
ex. Kathy collected data on the heights of students and their arm spans.
When recording the data, she mistakenly recorded a high of 181cm for 101cm. (Thus, an error in data collection)

2) If outside factors affect the data.
ex. A company records its total revenue each month.
Last month, the workers at the company went on strike for two weeks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

The presence of an outlier can affect the

A

analysis of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Other factors, such as

A

sample size and composition also need to be considered when analyzing data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Linear regression is only appropriate

A

if the data appears to be linear.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Non-Linear Regression:

A

Definition — An analytical technique for finding a curve of best fit for data having a non-linear correlation.

24
Q

Examples of other types of relationships are:

A

1) Quadratic: Equation of the form: y= ax^2 + bx +c
2) Cubic: Equation of the form: y= ax^3 + bx^2 + cx +d
3) Quartic: Equation of the form: y= ax^4 + bx^3 + cx^2 + dx + e
4) Exponential: Equation of the form: y = a(b)^x

25
Q

Quadratic: Equation

A

y= ax^2 + bx +c

26
Q

Cubic: Equation

A

y= ax^3 + bx^2 + cx +d

27
Q

Quartic: Equation

A

y= ax^4 + bx^3 + cx^2 + dx + e

28
Q

Exponential: Equation

A

y = a(b)^x

29
Q

Any and all of these relationships can be tested to see

A

which best fits the data.

30
Q

When fitting a curve, the measure that is used is called

A

the Coefficient of Determination (r^2).

31
Q

Coefficient of Determination (r^2)

A
  • Measures how closely the data is to the curve.
  • It is the Correlation Coefficient squared.
  • It can have values from 0 to 1, with 1 being a perfect fit.
  • The higher the r^2 value, the stronger the relationship. The stronger the relationship, the more accurate interpolated and extrapolated predictions are. Conversely, the lower the r^2 value, the less accurate it is in making predictions.
32
Q

Value Scale for Coefficient of Determination (r^2)

A
  • Weak is 0 to 0.33
  • Moderate is 0.33 to 0.67
  • Strong is 0.67 to 1
33
Q

Criteria for Analyzing the Best Regression Model:

A
  • See if a curve is present
  • Is there a negative y-intercept, does that fit the context? Can there be negative bacteria?
  • Is the y-intercept higher or lower than it should be? Does it fit within the values of the chart?
  • Look at the full chart, does the curve start to drop randomly? Even though the values continue to skyrocket? If the actual diagram does not fit the context of the data, it is not the right curve
  • The r^2 must be strong, the y-int must match the context, and the curve must be visually representative of the data for it to be the right curve.
34
Q

Degrees of Causal Relationships

A
  • Correlation studies between variables are mainly used to find evidence of a cause-and-effect relationship.
  • A strong correlation does not prove that a change in one variable caused a change in the other.
    = This does not necessarily mean a cause-and-effect relationship
    – There are various types and degrees of relationships between variables.-
35
Q
  1. Cause-and-Effect Relationship
A
  • A change in X causes a change in Y.
  • Ex: An increase in the speed of a production line produces an increase in the output of items produced
  • ie. more studying = better test scores
36
Q
  1. Common-Cause Factor
A
  • An external variable causes two variables to change in the same way.
  • Ex: A hot summer causes more people to go to the beach and increases the sales of water
    = the beach caused both of those
37
Q
  1. Reverse Cause-and-Effect Relationship
A
  • The dependent and independent variables are reversed in the process of determining the relation
  • Ex: A research project attempts to show that people drinking coffee get nervous but finds nervous people drink more coffee
    = the more coffee you drink the more anxious you are but anxious people tend to drink coffee - reverse
  • Ex: An increase in the number of police officers means more charges laid. The increase in crime has increased the number of police officers.
    = More crime happens because more police are there to notice the crime but they’re being hired due to more crime in the first place
  • Ex: Average amount of traffic in a city and the number of roads built.
    = More roads = more cars on the roads = more traffic
  • Constant cycle - has an intention and the reverse happened
38
Q
  1. Accidental Relationship
A
  • A correlation exists without any causal relationship between the variables (by accident).
  • Ex: An increase in SUV sales and an increase in sales of red pens.
  • Just a fluke - strong correlation with no connection
39
Q
  1. Presumed Relationship
A
  • A correlation seems apparent or present although it can’t be proven - a guess or assumption
  • Ex: Active people will like a new sports streaming service
    = Just because you go to the gym and value wellness doesn’t mean you necessarily love sports
  • Ex: The earth’s average air temperature and the concentration of CO2 in the atmosphere
    = So many other factors are involved
  • An assumption that there’s a correlation
40
Q

Extraneous Variable (Outside Factors)

A
  • Variables that affect the determination of a causal relationship.
  • It can affect the dependent or independent variable.
  • Ex. There is a strong correlation between students’ term mark and their final exam marks. However, extraneous variables that could affect the degree of this correlation are things like a student’s exam schedule, interrupted study time, the pressure of writing an exam, etc.
  • In order to reduce the effect of extraneous variables, researchers often compare an experimental group to a control group. These two groups should be as similar as possible, so that extraneous variables will have about the same effect on both groups. The researcher may vary the independent variable for the experimental group but not the control group. Any difference in the dependent variables for the two groups can then be attributed to changes in the independent variable.
41
Q

Subtle wording that changes the meaning of information.

A

e.g. Last year, the unemployment rate was 8.5%. This year, unemployment has had a 1.5 percentage-point increase.
- 8.5 -> 10% (+1.5%)
- Highlights a “small” increase
- Trying to cover up the 10% rate (by making people do the math in their head)

42
Q

The use of large numbers that can lead to misunderstandings about the significance
of data.

A

e.g. Annual healthcare spending in Ontario will increase by $80 million.
- from what?
- $80 million out of how much in total budget? (Ontario’s budget could be billions, so $80 million just seems big)
- how does it compare with education, infrastructure, etc

43
Q

Comparisons that are made where the items are not weighted equally.

A

e.g. The following chart shows unemployment statistics, by province, for Canada.
Instead of numbers of thousands of unemployed people, it should show the unemployment rate
The graph does not take into account the population of each province
Obviously, Quebec and Ontario have much larger populations
Should be a %, not the total amount (the unemployment rate %)

44
Q

Small samples are used to represent larger populations, which distort the data.

A

e.g. A manager uses a systematic sample to choose every 7” employee from a roster of 30 employees to test how an aptitude test compares to employee productivity. He concludes that the company should hire only applicants who do well on the aptitude test.
- Only 4 pieces of data follow the trend line (those 4 just happened to follow this trend when not everyone might follow it
- 30 people is small enough to have done an entire population of data

45
Q

The method(s) of presentation may not give the whole picture

A

e.g.
- Pyramids all the same size - misleading
- Months shown in the wrong order (reversed), ales are actually decreasing not increasing

46
Q

When data follows a general trend with no significant outliers,

A

this may indicate the presence of extraneous variables.

47
Q

Extraneous Variables

A

These are variables which influence the outcome of an experiment, though they are not the variables of interest.

48
Q

If possible, extraneous variables must be eliminated or at least accounted for as they

A

tend to weaken the correlation.

49
Q

Ways to eliminate extraneous variables:

A

1) Remove the variable ex: if classroom noise was a factor, make sure the experiment was done in a quiet room.

2) Use a control group ex: the experimental group gets the real pill, the control group gets the placebo pill.

50
Q

When data does not follow a general trend and has an unusual pattern such as two distinct clusters, it is not the work of extraneous variables but rather

A

hidden variables.

51
Q

When data follows an unusual trend, this may indicate the presence of a

A

hidden variable.

52
Q

These are variables which can skew statistical results so much that they can

A

invalidate the statistical results.

53
Q

If possible, hidden variables must be

A

researched and eliminated.

54
Q

How to eliminate hidden variables:

A

Separate your data into 2 clusters:
1) Before the event that caused the clusters
2) After the year that caused the clusters (include that year)
3) Then you can write the equation, the r, and the r^2

55
Q

Extraneous Variables Traits

A

Data follows trend
Can be accounted for

56
Q

Hidden Variables Traits

A

Unusual Pattern
Can invalidate results
Therefore must be eliminated

57
Q

Similarities between Extraneous and Hidden

A

Can weaken correlation
No outliers
Can be eliminated