Statistics Flashcards

chapter 1.1-1.4

1
Q

Describe the relationship between a dependent and an independent variable?

A

dependent variable is dependent of the independent one.
In an experiment the subjects are mostly the dependent variables and the treatment or the interventions are the independent ones. If the independent variable changes the dependent changes as well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define the four levels of measurement!

A

Variables can be measured
categorial so their category is important .

nominal: means that there are only a few categories you can put them into. pie chart / bar chart, central tendency - none
ordinal: the order of measured scores is important. central tendency: median / mean, histogram

quantitative variables:

ratio: every scale with an absolute zero point
mean , median ( with outliers)

interval: scale with a meaningful interval in between the numbers. mean / median ( with outliers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What makes a graph skewed to the left / right?

A

skewed to the left : The skew ( tail) is on the left . So the peak is rather on the right side

skewed to the right: the skew is on the right. Peak on the left side.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What attributes do the mode, median and mean have?

A

If you see a symmetric graph / histogram, the mode is always on the left the median in the middle and the mean on the right.

the mode is the least informative and the mean is the most informative.

The mean is sensitive to outliers and not as stable as the median.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the summary of 5?

A

The minimum, Q1, Median , Q3 and the Maximum.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do you calculate the Median? P.30

A

M= n+1/ 2 –> Location of M than add the values of the score location to the formular–> M= x1+x2/ 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you calculate the mean? P. 28

A

sum of the values of the score divided by n.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you calculate the quartiles ?

A

To calculate the quartile you first need to find the median and then take the median out of the two halfs you identified through the median.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define the boxplot! P.34

A

a graph of the five number summary. Not a real graph . A drawn box that visualizes the median, the two quartiles and the minimum and maximum on a scale.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do you calculate the interquartile range IQR ?

A

IQR= (Q3-Q1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you identify outliers? P.36

A

Multiply the IQR with 1,5 and

  • from Q1 -> everything underneath that value is an outlier

+ to Q3–> everything above this value is an outlier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What`s the standard deviation and what the difference towards the variance? P.38

A

the standard deviation looks @ how far the scores from their mean. And gives an average of a value like this.

The variance is simply the step before the standard deviation. From the variance you may also read out the difference to the mean, but the numbers are larger. This makes it easier to spot outliers.

The standard deviation is closer to the actual numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is it possible to define the correlation between two variables?

A
  • variables are either positively or negatively associated
    • association means that the scores / values of both variable are above average . x> average ; y> average- go together

negative correlation: one goes up the other goes down.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Whats a scatterplot?

A
  • visualisation of the correlation between two quantitative variables. f.i. price and rating
  • each variable is represented as a dot in the coordinate system.
  • -> you can easily spot outliers and see where there is a crowd of points . ( whether there is a weak or strong correlation)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What do I need to consider when figuring out a relationship between data?

A

to identify the relationship you need to :

  • identify the cases - how many cases are there , what kind of?
  • are the variables categorical / quantitative?
  • is it rather a response variable or an explanatory variable?
  • What are the values and labels for each variable?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Whats a log transformation?

A
  • log comes from logarithm and is 10^x ( 10 hoch iwas)
  • if you transform a scatterplot with a log transformation you take the values and take the root to make the distance between the scores shorter and the pattern more visible.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does smoothing have to do with a scatterplot?

A
  • its a geometrical method of creating a smooth curve in a scatterplot, that helps us in identifying a linear relationship.
  • the higher the smooth values is the straighter the curve is .
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does the little r stand for? P 101

A
  • correlation of two quantitative variables having a linear relationship.
  • -> it describes if this relationship is positive or negative and how strong of weak it is .
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the formula for r?

A

r = 1/ (n-1) * Z ( (xi-mean of x)/ sx) ((yi- mean of y)/sy) )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Tell me something about the character of r ?

A
  • r is not affected if you change the scale. meters-> centimetres the correlation stays the same
  • r is sensitive to outliers
  • r is always in between of -1 and 1
  • close to 0 = weak correlation ; close to 1/-1 scores are in a straight line
  • negative correlation = x below mean+ y below mean
    visualisation: left top corner to right bottom corner
  • positive correlation = x above mean + y above mean
    visualisation= left bottom corner to right top corner
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How can you put two categorical variables in a table? P 137

A
  • you need a two way table which represents two categories at once yes /no / group one / two
  • the yes / no on the left is the met requirement- whether a variable was met or not = row variable
  • the = column variable is the vertical variable and describes group 1/2
  • the combination of both is a = cell
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What the hell is a joint distribution? P. 138

A
  • the davidite of the thing you write in the cell ( cell entry) through the entire sample size–> you get a proportion
  • -> all proportions together give 1

the collection ( so not the sum - just all of the proportions written out) = the joint distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How do you call a single comparison of a variable in a two-way table?

A

marginal distribution - of the column variable, consist out of two values of whom the sum is 1

24
Q

How are the values for a met requirement - row variable called?

A

conditional distribution , both values are in sum 1

25
Q

Why is a mosaic plot called mosaic plot?

A

because it consists out of 4 mosaic stones, which represent the marginal and conditional distribution. Each sector ads up to 1.l

26
Q

What has a “ lurking variable “ to do with the Simpsons paradox? P.144

A
  • the so called third categorical variable is there because it is important for the evaluation of the table and the values.

third categorical variable: f.e. time

The Simpsons paradox occurs, if you have a two way table and you interpret the values in one way which is incorrect because a three way table is needed to interpret the data in the correct way.

27
Q

What does the line of least regression show us ?

A
  • the relation between the explanatory variable ( x) and the response variable.
  • difference to correlation is , that it is also used to predict the outcome of y ( response ) of the given value of x ( explanatory)
28
Q

What is the slope in relation with the intercept not with skiing ?

A
  • the slope is the steepness of the line of least regression
  • the intercept ist the crossing of the Y Axe x=0
29
Q

And how do you define them mathematically? The intercept and the slope?

A
  • straight line relating x to y is
    b1= slope; b1= r * Sy/Sx
    b0= intercept Eselsbrücke ( x= 0) ; b0= y-b1x
30
Q

Why did they give this least squares of regression line this fucking complicated name ?

And what does the formula look like?

A
  • because they wanted to state, that this line aims to be as close to the actual score of y and has therefor the smallest amount of squares distance from the actual score .

the formula is : ^y = bo+b1x

31
Q

Through which points does the least squares regression line always pass?

A

through the mean of x and the mean of y :)

32
Q

Try to explain the relation between the correlation and the line of least regression.

A
  • correlation is actually the slope ( steigung) of the line of least regression when they are standardised!
    formula : variance of predicted value ^y/ variance of observed value y
33
Q

Whats a regression line?

A
  • a straight line through a scatter plot that tries to show how the response variable y is changed by the explanatory variable y.
34
Q

Relation regression and correlation

A
  • The regression line is dependent on the correlation.

- if there is a perfect association , all points lay on the regression line .

35
Q

What’s the line of least regression?

A
  • a line which shows us the predicted values for y out of x and tries thereby to be as precise as possible .
  • It is also called the trendline in google sheets
  • the equation ( formula ) is = a+b*x
36
Q

When are we able to use the regression line?

A
  • when we have a linear correlation
  • if we can get the outliers out of our data

Not

  • if we have a nonlinear correlation
  • if we have a curvilinear correlation
37
Q

What the heck are residuals?

A
  • the deviation from the regression line to the real scores
  • so how wrong the regression line predicted .
  • -> a scatterplot helps us to grasp how much the regression line was wrong / or right .
  • perfect association means also perfect regression line- no residuals ( or all on 0 )
  • residuals are positive and negative and add up to 0
38
Q

How do we explain variation and what is unexplained variation ?

A
  • explained variation = the sum of squares from the regression
    So: the Sum of (^Y - Mean) squared
    -> to get the deviation from the mean
  • unexplained variation is what we find in the deviation of the residuals.
    So: Sum of ( Residual - Mean) Squared
  • both together are the total
39
Q

And what is than the proportion of the explained

variation ?

A
  • it is if you divide the explained variation through the unexplained variation
  • -> r squared !!!
40
Q

What is the problem with a restricted range?

A
  • it is problematic if you want to restrict your range , because it will significantly change your correlation .
  • the more homogenous a group is , the less correlation there is . –> Confounding variable.
41
Q

Whats a mediator variable?

A
  • a variable that
42
Q

How do you calculate z- scores ?

A

???

43
Q

How do you calculate r –> the correlational coefficient?

A

???

44
Q

univariat and bivariate graphs - how do outliers differently effect those graphs?

A

???

45
Q

how to interpret a shape of a curve?

A

unimodal, multimodal, outliers, skewed , symmetrical

46
Q

with what kind of shape of a curve does is make sense to calculate r?

A
  • linear

- linear with outliers( taking out the outliers)

47
Q

What’s a restriction of range - what effects does it have?

A

?

48
Q

synonym for contingency table , what does it represent, and how to represent perfect / no association?

A
? conditional distribution
joint distribution
perfect association
no association
how to calculate
49
Q

What the kappa coefficient? How to calculate?

A
  • the interrater agreement ?

- how much the raters of an essay for example agreed

50
Q
  1. What do we mean by the term variables? 2. What do we mean by the term individuals? 3. What do we mean by the term categorical variable? 4. What do we mean by the term quantitative variable? 5. What do we mean by the term nominal level of measurement (course content)? 6. What do we mean by the term ordinal level of measurement (course content)? 7. What do we mean by the term interval level of measurement (course content)? 8. What sort of information can we expect to find in the distribution of a variable? 9. What sort of information can we expect to find in a frequency table? 10. What sort of information can we expect to find in a pie chart? 11. What sort of information can we expect to find in a bar chart? 12. For which types of variables would we apply a pie or bar chart? 13. How would you go about creating a stem plot (also referred to as stem-and-leaf plot)? 14. What sort of information can we expect to find in a stem plot? 15. To which types of variables (level of measurement) is a stem plot suited? 16. When inspecting a distribution, we start by assessing its shape. Describe the main potential distribution shapes.
  2. What do we mean by the term skewed to the right, or positively skewed? 18. Now that we have determined the shape, we will assess whether the distribution is unimodal or multi-modal. Which factors point to a multi-modal distribution? 19. Once we have assessed the shape and (potential) multi-modality, we will check whether there are any outliers. What are outliers? 20. What is a histogram, and what sort of information can we expect to find in this type of graph? 21. To which types of variables (level of measurement) is a histogram suited? 22. The class interval you choose can influence the shape of the distribution, as represented in a histogram. Why? 23. What do we mean by the modal value of a distribution? 24. What do we mean by the term median?
  3. What do we mean by the term average? 26. How would you go about calculating the median and average? 27. In which three cases would the median rather than the average be reported as the measure of central tendency? 28. Outliers can pose a problem when determining the measure of central tendency for a distribution. Why? 29. How should we deal with these outliers? 30. What do we mean by the term percentile? For example, what is the 50th percentile? 31. What is the first quartile Q1? What is the third quartile Q3? 32. What is the interquartile range, or IQR? 33. The interquartile range is applied as a measure of dispersion. What does this measure tell us? For example, what can we conclude if the IQR for pre-university education students’ mathematics exam results equals 4? 34. Of which five statistics does the five-number summary consist?
A
  1. variables , describe are used to illustrate a relationship between two measurements . Either as independent and dependent variable or just as observed variables.
  2. individuals are the single subjects of a sample .
  3. a variable that is not quantitative measurable but orders subjects into the different categories . f.e. gender
  4. a quantitative variable measures the subjects either on a ratio scale or a interval scale .
  5. a measurement where the subjects / individuals get categorised into groups . ?
  6. measurement on a scale where the intervals are meaningful and can be compared to each other.
  7. A distribution of a variable gives us the variation of the variable and how much it is distributed over the scale .
  8. distribution of a variable: how spread the variable is over a scale .
  9. . frequency table : a table that shows the measurement of a interval or ratio variable. measurements of time and having an interval scale - measurements can be compared to one another.
  10. pie chart: categorical variable percentages
  11. bar chart : categorical variable nominal -
  12. nominal/ ordinal - categorical variables
  13. writing down the numbers and dividing them by ten to order them in decades.
  14. ordinal information, distribution, mode
  15. stem plot is suited to categorical and ordinal variables.
  16. main potential distribution shapes : linear , curved , skewed to the left, skewed to the right , unimodal, multimodal, symmetrical, asymmetrical
  17. that the majority of the curve is on the left side of the scale . the skew ( tail) is on the right side.
  18. factors for a multimodal shape : no symmetrical graph, several outliers .
  19. outliers can be detected with the interquartile range multiplied by 1.5 and added to the Q4 and subtracted by Q1.
  20. histogram : a bar chart that shows us the distribution of the variable and is ordinally ordered. Distribution, mode , central tendency.
  21. interval/ ratio variables . –> quantitative
  22. because the bigger the classes are ,the higher the bar will be, if the classes are smaller the distribution of values is gonna look wider.
  23. modal value of a distribution?
  24. the score of the middle position of all scores on a scale. Being the same amount of positions away from the minimum and from the maximum.
  25. average is the sum of all scored divided by the amount of all scores (n).
  26. calculating median: n+1 / 2 = position; the score of the position ; if position is 13,5 - score of 13. position and 14. position divided by 2 .
  27. three cases where to use median instead of mean
    - the graph has outliers
    - ordinal scale - categorical variable
    - skewed distributions

28.

  • some measurements of central tendency are very sensitive to outliers - the mean .
  • the outlier can draw a false picture of the central tendency because the mean got influenced by the outlier.
  1. use another measurement of central tendency
    - median / mode
    - getting them away- ignoring them
    • the percentiles are the positions in the score range
    • the range gets also separated into four quartiles
    • 0,5. is the median - 0.25 Q1 and 0.75 Q3
    • the score where 25% of the range’s scores are underneath and 75% are above
    • how far the dispersion between the Q1 and Q3 is
    • -> how wide the scores are spread over the whole scale

33.
What does the measure of the interquartile range tell us ?
IQR of math exam is 4 : that the exams grades are quite spread . The grades go from 1 to 8?

  1. Minimum, Q1,Median,Q3 and Maximum
51
Q
  1. How would you go about creating a box plot? 36. What sort of information could you derive from a box plot? 37. Describe the 1.5 * IQR criterion used to detect outliers. 38. According to the handbook, the critical inspection of histograms and box plots is generally more useful than the assessment of summary statistics such as measures of central tendency and of spread. Why is this often the case? 39. How would you go about calculating the standard deviation? 40. Why do we apply squares when calculating the standard deviation? 41. Why do we report the standard deviation, rather than the variance, in practice? 42. The standard deviation is a measure of dispersion. In which cases would we be more likely to report IQR rather than the standard deviation as a measure of dispersion? 43. What can we conclude if both the standard deviation and average are not resistant? 44. What is a linear transformation of scores for variable X? 45. What will happen to the median and average if we apply a linear transformation to the scores for variable X? 46. What will happen to the IQR and standard deviation if we apply a linear transformation to the scores for variable X? 47. What are z-scores (standard scores) and how are they calculated? 48. Does the transformation of X scores into z-scores count as a linear transformation?
A
  1. box plot : is the visualisation of the sum of squares .
  2. Minimum , Maximum , Q1,Q3 and Median
  3. if you time the IQR with 1.5 and you add it up to Q3 you get the amount after which you would define something as an outlier above. Subtract it from Q1 and you get the border to the negative outliers .
  4. Because looking at the histogram / box plot can give you all the information that you otherwise would need to calculate the other stuff.
  5. standard deviation –> root of the squared variance
    sx= root of ( Summ of ( xi-Mean) / N- 1 ) squared
  6. why squared when we apply the standard deviation ?
    because than it becomes positive
  7. the standard deviation is closer to the actual numbers than the variance. The variance gives a better overview over the whole thing and the outliers . The standard deviation is more precise.
  8. If there are outliers . Because the Standard deviation is calculated with the mean . Sx is also sensitive to the mean.
  9. We can use another form of measurement of dispersion for example the IQR .
  10. linear transformation is to transform your data to another scale. For example Kg to mg . The data doesn’t change.
  11. The median and mean will stay the same.
  12. To the IQR will maybe change because if we change for example from a scale of celsius to Fahrenheit. The numbers get larger. The standard deviation wont change, because it will still be the compared the same deviation from the mean.
  13. z scores indicate how many standard deviations a score is above or below the mean.

48.
yes because it transforms the scores in another scale.

52
Q
  1. What do we mean by the terms univariate frequency distribution and bivariate frequency distribution? (course content) 2. Under which circumstances can we expect an inevitable association between variables X and Y? 3. How can we apply a contingency table to determine whether there is any association between categorical variables X and Y? 4. What do we mean by the term joint distribution? 5. What are contingency table marginals?
A
  1. univariante is measuring one variable and comparing scores among each other. Bivariant is measuring two variables and looking at the relationship of those variables. Also called one way and two way table.
  2. if we have a linear correlation, a correlational coefficient of 1 .
  3. If the relative frequency distribution is different between the columns and rows there is a association, if it is equal there is no association.
  4. Marginals show us the univariante distributions : so the distributions where only one variable is taken into account . The distribution where both variables are taken into account is so called the conditional distribution.
53
Q
  1. What do we mean by the term conditional distribution? 9. Let’s say there is no association whatsoever between categorical variables X and Y. We can then calculate the cell frequencies on the basis of the row totals. How would you go about this calculation? (course content) 10. Under which circumstances can we expect to see a perfect association between variables X and Y? (course content) 11. Why don’t we simply apply the proportion of corresponding ratings by two raters as the measure for inter-rater agreement? 12. How can the Kappa measure help us to correct this problem? 13. What is the maximum value of Kappa? 14. Does a high Kappa value reflect a high or low level of inter-rater agreement? 15. What do we mean by the terms response variable and explanatory variable? 16. What sort of information can we expect to find in a scatter plot?
    42
  2. What is the level of measurement for the variables in a scatter plot? 18. When inspecting a scatter plot, you will focus on the nature, direction and strength of the relationship. What could you expect to see? 19. When inspecting a scatter plot, it is important to identify any outliers. What does this involve? 20. Under which circumstances can we expect to see a negative association between two interval variables, and under which conditions can we expect a positive association? 21. What shape would the scatter plot point cloud take if there was a certain (but not perfect) degree of association between X and Y that could be categorised as positive linear? 22. Now answer the same question for a negative linear association. 23. What shape would the point cloud take in the event of an entirely positive or negative linear association between X and Y? 24. Under which circumstances would the point cloud lead you to expect the existence of sub groups? 25. Imagine you want to visualise the association between age and sensation seeking in a scatter plot. How could you incorporate information on the individuals’ gender into such a point cloud? 26. Give an example of a case in which the association between two interval variables could be influenced by a third – lurking – variable. 27. What sort of graph could we use to visualise the association between a categorical variable
A
  1. conditional distribution: its the data in the cell we collected and put in a contingency table. In the Rows and cells.
  2. Cell frequency on the basis of the row totals: Deviate the content of the cell through the row total?
  3. if we see in the columns, that one condition is 0% the other is 100%. Than we can perfectly predict under which condition there will be a failure and under which there will be a win.
  4. why not apply proportion of corresponding ratings with the inter raters agreement ? Because we can only see the actual observed agreements and not the expected one and can not take those into account.
  5. the Kappa value establishes the relationship of expected and unexpected agreements.
  6. the maximum is the total.
  7. a high kappa value reflects a high level of agreement .
  8. response variable is =y
    explanatory variable = x
  9. The visualisation of the correlation of two measurements . Relationship explanatory and response variable.
  10. quantitative , interval and ratio
  11. ir it is linear or not , if they are close together it is a strong relationship if not its not …
  12. the ones above IQR *1.5 + Q3 and below IQR 1.5 - Q1
  13. when the scatterplot lies in the forth and second quartile there is an upward trend if the majority lies in the first and third quartile it is a downward trend.
54
Q
  1. What shape would the scatter plot point cloud take if there was a certain (but not perfect) degree of association between X and Y that could be categorised as positive linear? 22. Now answer the same question for a negative linear association. 23. What shape would the point cloud take in the event of an entirely positive or negative linear association between X and Y? 24. Under which circumstances would the point cloud lead you to expect the existence of sub groups? 25. Imagine you want to visualise the association between age and sensation seeking in a scatter plot. How could you incorporate information on the individuals’ gender into such a point cloud? 26. Give an example of a case in which the association between two interval variables could be influenced by a third – lurking – variable. 27. What sort of graph could we use to visualise the association between a categorical variable and an interval variable? 28. How can we deduce whether there is any association between two variables in such a graph? 29. What does the correlation rxy express? 30. Specify the boundary values for rxy. 31. What is the meaning of an rxy value of 1 and -1, respectively? 32. If rxy equals 0, can we conclude that there is no association whatsoever between X and Y? 33. Specify the requisite level of measurement for X and Y needed to conduct a meaningful calculation of rxy. 34. What will happen to the correlation between age and income if we express age in months instead of years, and express income in guilders instead of euros? 35. What will happen to the scatter plot that expresses the relationship between Age and Income? 36. What does the negative rxy value point to? 37. How will the outliers in the point cloud affect the correlation between X and Y? 38. If the correlation between variables X and Y equals +1, can we safely conclude that the averages of X and Y are equal?
A
  1. it would go from the forth quartile to the second quartile and the points would be tight together forming almost a line.
  2. the same would be the thing for the negative correlation with the exception, that the line would go in the direct opposite direction. With a so called downwards trend from the first and third quartile.
  3. it would be either in the second and forth ( positive ) quartile or the third and first quartile ( negative) .
  4. if it would be two separated point clusters.
  5. I would use separate colours for the separate genders.
  6. for example if you want to compare the rate of car accidents from different age groups and you look at the sum of accident in one year and the age intervals. You don’t take into account the time intervals in which those accidents happend .
  7. a histogram
  8. we can see by looking at the graph, what relation the different categories have towards each other .
  9. it does express the strength and the direction of a linear correlation . And if a correlation is present or not.
  10. they reach from -1 to 1 .
  11. 1 perfect positive cor. ; -1 perfect negative cor.
  12. Yes this means that there is no association between x and Y. or only no linear ?
  13. requisite level of measurement : it must be a linear correlation -
  14. nothing will happen because the relation among the two variables doesn’t change.
  15. the scatterplot will become maybe a bit bigger, but will have the same shape just spread over a larger scale.
  16. a negative rxy tells us, that the increase of x causes the decrease of y.
  17. the outliers will draw the correlation in their direction so to say .
  18. no we can not , because they have maybe different scales . Just the correlation between them is perfect. Perfect dependence of each other. Doesn’t mean, that they are the same in numbers.
55
Q
  1. What do we mean by regression of Y on X? (course content) 2. b is the slope in a Y = a + bX linear regression equation. What does it express? 3. a is the intercept in a linear regression equation. What does it express? 4. Can the slope tell us anything about the extent to which X and Y are associated? 5. Which criterion do we apply to determine the best fitting straight line in a point cloud and what does this entail? 6. According to this criterion, what would be the most accurate prediction of Y given X, if rxy = 0? (course content)
    43
  2. We can use the best fitting straight line to predict Y on the basis of X as accurately as possible. However, we must be wary of extrapolation. What is extrapolation, and why does it pose a risk? 8. If we convert X and Y into standard scores (see module 1), what is the value of the intercept and slope in the regression equation? 9. What does the linear correlation tell us about the linear regression of Y on X? 10. Does it matter whether we regress Y on X or vice versa? 11. Will a decision as to whether we regress Y on X or vice versa affect the regression equation? 12. What can we deduce from the square of the linear correlation coefficient? 13. If we draw the best fitting straight line in the point cloud, the degree of dispersion around this line may vary. Can this degree of dispersion tell us anything about rxy? About a? About b? About the quality of the predicted value of Y, given X? 14. What do we mean by the term residual? 15. How would you go about creating a residual plot? 16. If the association between X and Y is non-linear, how would this be reflected in a residual plot? 17. Why will an outlier in X direction have a clear influence on the slope of the regression line, whereas this will not apply in the case of an outlier in Y direction? 18. What do we mean by the statement ‘correlation and regression are non-robust’? 19. What is the restriction of range problem, and when could this phenomenon occur?
A
  1. that the regression line is going to predict the values for y on the basis of the values of x.
  2. it expresses the steepness of the regression line
  3. the intercept is the point where x is zero . Crossing of the y axis.
  4. yes it can tell us how strong the association is depending on the steepness.
  5. the best fitting straight line from the points that are the furthest away from each other- in a linear direction.
  6. they would be equal ?
  7. extrapolation is the polation of an outlier, that draws the best fitting straight line in its direction .
  8. I don’t know
  9. if the correlation is strong the regression line is going to be predictable and straight.
  10. Does it matter what we regress on what?
  11. square of the correlation coefficient
  12. the tighter the points are together, the stronger the correlation is .
  13. the deviation from the regression line .
  14. collecting the deviations from the line of regression and putting those in a scatterplot with a central line of 0 in the middle.
  15. the residual plot would for example also have waves and the point would go in a wave shape.
  16. slope of the regression line outlier in x direction – because it draws it to it.

18 . that they are sensitive to outliers.

  1. The restriction of range problem occurs when it is important to keep the whole range in order to correctly interpret the data.
    for example restricting the range in order to