5 | Introduction To Statistics Flashcards

Question

R: how do you get the median from a data object *survey* with a subset *cm*? What if there are some empty cells?

Answer 1

> median(survey$cm) > median(survey$cm,na.rm=T)

Answer 2

returns: - min - 1st qu - median - mean - 3rd qu - max - NA's of that dataset

Answer 3

> mean(survey$cm,na.rm=TRUE,trim=0.1) removes upper and lower 10%

Answer 4

min() max() sd()

Answer 5

> quantile(survey$cm,c(0.25,0.5,0.75),na.rm=T)

Answer 6

> summary(survey[,c(1:4)])

Answer 7

- mean: too sensitive to outliers! - trimmed mean: less sensitive - median: not sensitive - most robust measure --> use trimmed mean or median if there are outliers

Answer 8

- compute summary stats for data subsets - very similar to tapply() but: - can also input formula or time series object - output is df

Answer 9

> aggregate(survey$cm,by=list(survey$gender),median, na.rm=T)

Answer 10

``` > boxplot(survey$cm~survey$gender, col=c("red","blue")) ```

Answer 11

extensions to IQR 1.5 x IQR. anything beyond this: outlier

Answer 12

Normalisation procedure - data transformation - data now have mean of 0 - data now have sd of 1 - 95% of data are within a z-score of -1.96 and + 1.96 z = (x - x¯ ) / s

Answer 13

> table(survey$gender) > table(survey$gender,survey$smoker) NA's are removed automatically Yeas you can

Answer 14

Two ways: > summary((survey$cm - mean(survey$cm,na.rm=TRUE))/ sd(survey$cm,na.rm=TRUE)) > summary(scale(survey$cm))

Answer 15

The cut function is used in R for cutting a numeric value into bins of continuous values numeric --> categorical (factor) 'poor mans start' to get first overview of data

Answer 16

categorical data

Answer 17

> cSize=**cut**(survey$cm,c(0,160,185,250)) (values are breakpoints so intervals will be 1- 160, 160 - 185, 185 - 250) > **levels**(cSize)=c("dwarfs","normals","giants")

Answer 18

``` > cSize=cut(survey$cm,c(0,160,185,250),ordered=TRUE) ```

Answer 19

* Describe and summarize data * Visualize trends for better understanding * Make conclusions about a population based on analysis of the sample

Answer 20

* Decide whether two groups can be taken as different * Decide if sample is different to total population * Describe the relationship between two variables (proportional analysis) * Decide if data difference is just a random one

Answer 21

* Statistics just estimates “degree” of randomness * Statistics can’t tell you how likely it is that there is really a difference

Answer 22

* Sample items must be randomly selected (not a subset but a random selection!) * Items must be independent from each other (selection of one item should not alter the chance of other items to be selected)? * Having a plan before entering the greenhouse * More samples are better * Groups: balanced sampling is better

Answer 23

* Cats falling from windows: information was usually from veterinarians  just those that survived falling! (= convenience sample) * Roosevelt survey - telephone interviewing. Mostly just wealthy people had phones  republicans * Coronavirus: selective testing - no proper sampling of population!

Answer 24

* your cohort and topic might change over time (cancer, virus) * true population is more diverse than the population you were sampling from * using a convenience sample rather than a random sample (falling cats) * your measured variable is just a proxy for another variable * imprecise measurements (misunderstandings, wrong scale for some people) * combination of different measurements required * even with clear results: there is room for interpretation * statistics can’t help against bad experimental design

Answer 25

* Accidental sampling (close to hand) → often very biased * Simple random sampling * Systematic sampling (eg every 10th person) * Stratified sampling (making subgroups based on categories) * Cluster sampling

Answer 26

* SRS * Every element has same chance of selection as sample (eg dice) * Randomness might be a problem, especially for large populations or small samples * Systematic or stratified sampling might overcome this * Example: balancing sexes 50/50 or 60/40 (UP ratio)

Answer 27

* Arranging population by some ordering * Selecting elements in regular intervals * Aka interval sampling Example: * Every kth person, but don’t start at the beginning Issues: * Vulnerable to periodicities, every 10th house is on street crossings …

Answer 28

* You know categories in your data eg male, female, old, young… * Sample from each category according to your distribution of those categories in your population * One of the methods above for sampling within the strata

Answer 29

* correlational research * experimental research

Answer 30

* Just observe what happens in nature * we don’t manipulate a variable * does reading books improve learning * we just collect answers for reading behaviour

Answer 31

* we manipulate a variable * item divide our sample for reading in two groups * one group must read statistic books * other group is not allowed to read statistics books * after a month we summarize

Answer 32

It's difficult to research experimentally for some reason: - ethical reasons - financial reasons

Answer 33

* Measurement levels * Categorical (qualitative) vs numerical

Answer 34

* Uni-, bi-, or multivariate data

Answer 35

* Nominal: eg gender, smoker, protein structures, nucleotides etc * Ordinal: age (young, medium, old), grade, month etc

Answer 36

* Discrete: eg age (1,2,3,4..), height (eg 100, 101, 102 cm), length of helices * Continuous: eg height (eg 100.100…cm), weight (79, 998…kg)

Answer 37

A categorical level of measurement that doesn't have any order. eg - gender: female, male (binary) - smoker: yes, no (binary) - protein structures: H, E, … - nucleotides: A,C, G, T, U

Answer 38

A categorical level of measurement with a specified order. eg: - age: young,medium,old - grade: 1,2,3,4,5 - month: 01..12(?)

Answer 39

A numerical level of measurement, of discrete numbers. eg: – age: 6, 8, 84 – height: 112, 176, 161cm – number of helices per 1000 AA – length of helices

Answer 40

A numerical level of measurement, non-discrete eg: – weight: 79.99kg, 72kg,… – height: 12.2, 12.5, 15.0

Answer 41

Can you calculate a mean? - Yes --> - is the mean always a possible value? - Yes --> continuous numerical - No --> discrete numerical - No --> - Is there a logical order of values? - Yes --> Ordinal categorical - No --> Nominal categorical

Answer 42

how many variables there are in the statistical problem

Answer 43

* Different group number and distribution  different statistical tests to use

Answer 44

* Mean – sensitive to outliers * Trimmed mean – less sensitive to outliers * Median – not sensitive to outliers * Modus – more relevant to numerical data

Answer 45

* independent variable

Answer 46

* Gender * Modus: most frequent value

Answer 47

* Age (young, medium, old) * Modus is most appropriate * (Median, mean could also possibly be used if the data can be converted to numerical)

Answer 48

* Numerical data – both discrete and continuous

Answer 49

* Max, min, IQR, standard deviation, CV * (To describe the population: SEM, CI, p-value)

Answer 50

* Sample → characterised by statistic * Population → characterised by parameter * We use the sample to estimate the parameters of the population * parameters and statistics have the same name but mean different things * Usually: latin letters for sample, Greek letters for population

Answer 51

How can we look at the dispersion of a sample? * SD * Quartiles * IQR = the range for 50% of around mean data (box of boxplot is the IQR)

Answer 52

What is z-score? * A Transformation → Normalisation procedure * New mean: 0 * New sd: 1 * 95% of data are within a z-score of -1.96 and + 1.96 …

Answer 53

z = (x - x̄) / s

Answer 54

* converts a numeric or character vector into a factor with levels * converts a vector into a factor, preserving its categorical nature. * Factors are essential for handling categorical data in R.

Answer 55

* Returns the mean of data, removing any NAs and trimming off outliers.

Answer 56

* median() * min() * max() * sd()

Answer 57

* Returns median (value at 50%, which divides sample in 2) * Along with other interquartile values (25%, 75%)

Answer 58

* Numerical data: returns min, max, quantiles, number of NAs * Categorical data: returns count of groups, NAs

Answer 59

* Trick question – there isn’t one * But it can easily be implemented

Answer 60

* Syntax: aggregate(x, by, FUN, ...) * apply a function to subsets of data, typically grouped by one or more factors or variables. * The data in x is divided into groups based on the by variable(s). * The specified function FUN is applied to each subset of the data. * Result is a summary table (df) showing grouped values, computed summaries.

Answer 61

* Tabulates data to give summary (no NAs) * Table is a matrix so relevant functions can be applied * useful mostly for categorical data * (numerical data can be transformed into categorical data and that way tabulated as well )

Answer 62

* Categorical data

Answer 63

* cut() function, assign levels with function * Example: * csize = cut(survey$cm, c(0,160,185,250) [add “ordered=TRUE” to keep the order] * levels(csize)= c(“dwarfs”,”normal”,”giants”)

Answer 64

* nominal factors: can only use == or != * ordered factors: can use also numeric operands , <=

Answer 65

The goal of _descriptive_ statistics is to summarize and describe the _sample_ whereas the goal of _inferential_ statistics is to conclude to the _population_. We use sample _statistics_ to estimate unknown _parameters_ of the population.

5 | Introduction To Statistics Flashcards

(95 cards)