6 | Statistics for Proportions and Frequencies I Flashcards

Question

What are some requirements for doing chi square? (statscast)

Answer 1

* At least one case in every cell in the table * At least 80% of table cells should have 5 or more cases (some say all cells) * → if you have too low, you can combine some levels or collect more data * All data should be independent, ie scores should not influence one another

Answer 2

* At least 1 case in every cell, 80% should have at least 5 * A significant chi square test doesn’t tell you which levels of your variables are driving the effect → Chi squares with many levels can be difficult to interpret * Results of inferential statistics only to be applied to pops that resemble sample * All data should be independent, ie scores should not influence one another

Answer 3

* Non-parametric test → more robust than parametric tests (t-test, ANOVA, ..) * Don’t even need normal distributions!

Answer 4

A Chi Square test of independence was used to check for a relationship between x and y, Χ² (df) = 6.0, p < .05, indicating a statistically significant relationship between x and y. (statscast)

Answer 5

* Part of chi squared test * normalized measure for the distance of each cell frequency to the expected data * residuals = (Observed – Expected) / √Expected * → see any under- or overrepresentation / how far data from an independence table

Answer 6

* table() * returns a contingency table

Answer 7

* ftable() * returns a contingency table

Answer 8

apply() → you can ignore some fields and just extract some dimensions

Answer 9

* use table() eg tab = table(x) * check what the maximum is eg idx = which(max(tab)==tab) * return(names(tab)[idx])

Answer 10

cut() ``` Survey$cm = c(150, 187, 165, 166, 170, 180, 145, 191, 160,) cSize=cut(survey$cm,c(0,160,185,250)) # ‘dwarves, normal, giant ‘ ```

Answer 11

* either: addmargins(tab), then calculate eg tab[x,y]*tab[y,z]/tab[x,z] * or: chisq.test(tab)$expected

Answer 12

Chisq.test()$residuals

Answer 13

Calculate the proportions with prop.table() eg to: * Summarize proportions from contingency tables. * Normalize data (row-wise, column-wise, or total proportion). * Check categorical distributions in datasets. * Compare observed vs. expected values (like in Chi-square tests)

Answer 14

Function for a proportion table? And how to control whether the rows or columns sum to 1? * prop.table(table) * rowwise: prop.table(table,1) * columnwise: prop.table(table,2)

Answer 15

* Pie chart * Barplot * Dotchart

Answer 16

* Mosaicplot * Assocplot (or assoc from vcd) * Fourfold

Answer 17

configure graphical parameters plotting multiple graphs in a single figure: ``` > par(mfrow=c(1,3),mai=c(0.5,0.5,0.5,0.3)) # three figures - 1 row, 3 columns; margins in inches around plot > pie(6:11,col=1:6,cex=2) > barplot(6:11,col=1:6,cex.axis=2,cex.names=2) > box() > dotchart(6:11,cex=1.4,col=1:6,xlim=c(0,12),pch=15) ```

Answer 18

Human eye is not so good at noticing the differences in a pie chart → sometimes better to use eg a barplot or a dotchart.

Answer 19

* less cluttered * less ink * less redundant * overlay second variable

Answer 20

* cex: marker size * col: colour * pch: shape of marker

Answer 21

* Exploring the relationship between two variables * absolute numbers visualized * Visualises proportions with area

Answer 22

* Assocplot() * Exploring the relationship between two variables * Visualisation of pearson residuals

Answer 23

``` > par(mfrow=c(1,2), mai=c(0.4,0.7,0.5,0.0)) > mosaicplot(table(cSize, survey$gender), col=c(2,4),cex=1.0,main="mosaic") > assocplot(table(cSize, survey$gender),main="assoc") ```

Answer 24

Mosaicplot / assocplot * Width: actual proportions by absolute value / proportional to square root of total individuals * Height: actual proportions by absolute value / proportion according to pearson residuals

Answer 25

* One can also used a stacked barplot to visualise two variables - However this does not show actual numbers in the width so mosaicplot is more useful * One can also show pearson residuals with a dotchart, but its much more difficult to grasp than with an assocplot

Answer 26

* vcd library assocplot, shows the scale of the pearson residuals * ie shows the significance of the pearson residuals, also with colour coding if desired ``` > library(vcd) > assoc(aids.azt,shade=T) ```

Answer 27

A fourfold plot provides a graphical expression of the association in a 2×2 contingency table, visualising the odds ratio. Each cell entry is represented as a quarter-circle (denoted by the middle of the three rings). * Actual numbers from contingency table shown in the corners * Proportions in relative terms * Dark blue: significant changes between groups * Confidence intervals – lines above and below circle

Answer 28

``` cotabplot(aids.azt,panel=cotab_fourfold) ```

Answer 29

Hist, density

Answer 30

boxplot (stripchart) (violinplot)

Answer 31

* (pie) * (barplot) * Dotchart – best option

Answer 32

Boxplot (stripchart) (violinplot)

Answer 33

* Association plot * Mosaic plot * Fourfold plot

Answer 34

* Contingency & proportion tables * Modus * Visualising: assoc, mosaic, fourfold plots

Answer 35

* generalize from sample to population * not only for the average, but also spread, form of the distribution

Answer 36

* Probability used as a measure of uncertainty * Uncertain as long as our sample is smaller than population → we estimate parameters of population and we specify the extent of uncertainty * How: compare out data with random data where we know

Answer 37

* compare our data with random data where we know there is no effect

Answer 38

* a phenomenon is called random if the outcome cannot be calculated with certainty * ex: coin tossing, we don’t know outcome before

Answer 39

* Sample space S: collection of possible outcomes * Eg coin: S = {Head, T ail} (Coin, ns = 2) * Eg dice: S = {1, 2, 3, 4, 5, 6} (Dice, ns = 6) * sample space can contain an infinite number of outcomes – eg body height if it could be measured exactly

Answer 40

P(Head) + P(T ail) = 1

Answer 41

* (E) is a subset in sample space * possible event for three dice rolling: E = {1, 2, 4} * probability of this event P(E) = 1/6 + 1/6 + 1/6 = 1/2

Answer 42

* for any event there exists a complement: those items in sample space but not in event (disjoiny) * probability of the complement: P(E^c) = 1 - P(E)

Answer 43

* the set of outcomes that are at least in one of the events

Answer 44

* two events → conditional probability * P(E1|E2) = probability of event E1 if E2 has occurred

Answer 45

* two events independent if knowledge of outcome of E1 does not alter prob. for event E2 * P(E1|E2) = P(E1) and P(E2|E1) = P(E2)

Answer 46

* Marginal probabilities * P(E1 ∪ E2) = P(E1) x P(E2) * Dice throw: two times a six in sequence → P(E_6u6) = 1/6 x 1/6 = 0.028

Answer 47

P(A|B) = P(A∩B) / P(B), if P(B) ≠ 0

Answer 48

* Allows calculation of Ps for events which are not independent of each other * mathematical rule for inverting conditional probabilities → find P of a cause, given effect. * (= Bayes’ law or Bayes' rule, after Thomas Bayes)

Answer 49

P(A|B) = P(B|A)P(A) / P(B)

Answer 50

* P(C|SM) = P(SM ∩ C) / P(SM) (conditional probability * (0.4 * 0.001)/0.25 = 0.0016 * smokers have around 60% higher risk of lung cancer * with Bayes theorem: P(C|SM) = P(SM|C) * P(C) / P(SM) https://www.youtube.com/watch?v=HZGCoVF3YvM

Answer 51

* supports statistical needs of experimental scientists and pollsters; * does not support all needs; gamblers typically require estimates of the odds without experiment * in SBI course: mostly frequentist

Answer 52

Bayesian: * having prior probability and posterior probability * degree of belief in event (based on prior knowledge, previous experiments, or beliefs) * combining old data with new evidence Frequentist: * probability = limit of relative frequency of event after many trials * → probabilities can be found (in principle) by a repeatable objective process * → thus ideally devoid of opinion

Answer 53

* A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. * The term 'random variable' in its mathematical definition refers to neither randomness nor variability Eg random variable X for head or tail outcome of a coin: * P(H) + P(T) = P(X = 0) + P(X = 1) = 0.5 + 0.5 = 1 * set of possible values is the range for the variable: range of X: X = {0, 1} Eg dice: * P(X=1) = 1/6 * range of X for cube: X = {1, 2, 3, 4, 5, 6}

Answer 54

* Bernoulli special case of Binomial distribution with n = 1 (just 1 trial) * Binomial distribution: seen when sequence of independent random variables with same P * parameter Θ is the probability of success (for events like: 1, survived, YES, female, tails) * P(X = 1) = Θ * P(X = 0) = 1 – Θ

Answer 55

* Has parameters n and p * discrete P distribution of number of successes in sequence of n independent experiments * each own Boolean-valued outcome: success (with probability p) or failure * with probability P(q) = 1 – P(p) * basis for the binomial test of statistical significance.

Answer 56

* Special case of binomial distribution * A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment, * sequence of outcomes is called a Bernoulli process * for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution.

Answer 57

* r: random number generator * p: probability function (cumulative probability function c.d.f) * d: density function (point probability) * q: quantile function (inverse c.d.f)

Answer 58

* pnorm(): Cumulative probability * dnorm(): Probability density * qnorm(): Quantile function * rnorm(): Generate random numbers

Answer 59

> # 20 times each time 10 coin trials # number of tails > rbinom(20,10,p=0.5) [1] 4 6 1 6 3 4 7 7 5 6 7 6 7 7 6 7 3 8 7 6 > # bernoulli special case with n=1 just one coin trial > rbinom(20,1,p=0.5) [1] 1 0 1 0 1 1 0 0 1 1 1 0 1 0 0 0 1 0 0 1

Answer 60

* binomial distribution has upper limit: throw coin 50 times → max head value is 50 * count numbers without theoretical limits (spatial or temporal) → often follow Poisson distribution * lower limit is zero, but no upper limit

Answer 61

* count cells in a grid, number of visits of doctors by a patient … * parameter λ = rate of occurrence within a certain time or space, the mean of the sample. * Higher λ → higher average of all

Answer 62

A random variable has a Chi-square distribution if it can be written as a sum of squares of independent standard normal variables. Sums of this kind are encountered very often in statistics, especially in the estimation of variance and in hypothesis testing. (wiki) (watch youtube videos!)

Answer 63

If we cross-tabulate random two variable distribution (eg binomial, passion) → χ ² = Σ_{1 ≤ j ≤ m}(n_jo – n_je)²je χ ²_yates = Σ_{1 ≤ j ≤ m} (|n_jo – n_je| - 0.5)²je n_j= observed, n_je = expected

Answer 64

> chisq.vals=rchisq(1000*1000,df=1) > hist(chisq.vals,col='light blue') > box() > abline(v=res$statistic,lwd=3,col='blue') > length(which(chisq.vals>res$statistic))/ + length(chisq.vals) [1] 1e‐05 (watch youtube videos!)

Answer 65

Bernoulli, Binomial, Poisson

Answer 66

From a 2x2 contingency _tables_ we can calculate the so called _independence _ table which holds the counts for the data which we would get if the is no relationship between our two variables. The deviations _observed _ minus _expected_ values can be used to calculate the _Pearson residuals _.

Answer 67

The formula to calculate the Pearson residuals for every cell of a contingency table is: (_observed_ - _expected_) / _sqrt(expected)_ The _chisq_value calcuation uses a similar formular (with out sqrt) and sums up the values for every cell. _Higher_values of this measure are more likely to produce low p-values than _lower_ values

Answer 68

modus, mosaicplot, assocplot

Answer 69

p-values, confidence intervals, significant, effect