Statistics Flashcards

1
Q

What is discrete data

A

non decimal/ non fractional number

for example number of children in a classroom (you can’t have decimal amount of children)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is continuous data?

A

includes decimals

E.G height, weight

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a measure of central tendancy?

A

where the center of our data falls

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does it mean when data is skewed?

What is a positive skew and what is a negative skew? Draw them both

A

Data can be “skewed”, meaning it tends to have a long tail on one side or the other:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When do you use mean as a measure of central tendency? When do you not use the mean as a measure of central tendency?

A

when your data distribution is continuous and symmetrical when it is quantitative and uses all pieces of its data

Not:

  • when you have extreme values (outliers)
  • when you have skewed data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When do you use the median as a measure of central tendency

A
  • when the data is quantitative
  • used when there are extreme values as these do not affect the median
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

When do you use the mode as a measure of central tendency?

A

used for nomial data (data that can be labelled or classified into mutually exclusive categories within a variable.

These categories cannot be ordered in a meaningful way.

For example, for the nominal variable of preferred mode of transportation, you may have the categories of car, bus, train, tram or bicycle)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the advantages and disadvantages of box plots?

A

Pros:

  • helps us to see the spread of data more easily
  • plot is clear and easy to understand
  • it uses range and median values
  • it is easy to compare the stratified data

Cons:

  • Original data is not clearly shown in the box plots
  • mean and mode cannot be identified using the box plots
  • it is easily misinterpreted
  • if large outliers are present, the box plot is more likely to give an incorrect representation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When is the regression line a valid model?

A

when the data shows linear correlation

stronger correlation = higher accuracy

When trying to estimate a DEPENDENT variable (y coord)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a census? Name the pros and cons

A

When each member of a population is used

Pros:

  • completely accurate

Cons:

  • time-consuming
  • expensive
  • cannot be used when it destroys population
  • hard to process large quantity if data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does a sampling frame mean?

A

the source material or device from which a sample is drawn

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the three METHODS of random sampling?

Give definitions

A
  • simple random sampling= every sample size of n has an equal chance of being selected (uses sampling frame)
  • Systematic sampling = the required elements are chosen at regular intervals from an ordered list
  • Stratified sampling = population split into strata differences and a random sample taken from each
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the equation for the number sampled in strata?

A

number in strata/ number in population x overall sample size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why is random sampling useful?

A

it removes bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the pros and cons of simple random sampling?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the pros and cons of systematic sampling

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the pros and cons of stratified sampling?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the two types of NON-random sampling?

What are the definitions?

A
  • Quota sampling = interviewer/ researcher slects the charcteristics of the whole population
  • Opportunity (or convenience) sampling = taking the sample from people who are available at the time the study is carried out and who fit the criteria you are looking for.
19
Q

What are the pros and cons of Quota sampling?

A
20
Q

What are pros and cons of opportunity sampling?

A
21
Q

What are the types of data/variables

(data is interchangeable with variables)

A
  • Quantitative data = data/variables associated with numerical observations
  • Qualitative data= associated with non-numerical observations
  • Continous variable = can take any value in a given range
  • Discrete variable = only specific values in a given range (no decimals)
22
Q

What is meant by population?

A

a set of data that you can take a sample of

the whole set of items that are of interest

23
Q

When do you increase the lower and upper bound by 0.5

A

when the classes do not overlap

e.g. 10

24
Q

What is the equation for variance and standard deviation for raw data

A
25
Q

what does Sxx

What is variance in terms of Sxx

A

Sxx= (xi- mean)2

26
Q

What is the equation for the mean and standard deviation for frequency tables?

A
27
Q

what does strata mean?

A

when the population is split into groups based on similar characteristics

28
Q

What does a statistic mean?

A

a random variable that is a function of a sample which contains no unknown quantities/parameters

29
Q

Inferential statistics - definition

A

methods of making decisions and predictions about a population based on a sample selected from the population.

30
Q

Sample- definition

A

A sample provides a set of data values of a random variable, drawn from all such possible values. A sample is a subset of the target population.

31
Q

parameter - defintion

A

a numerical summary of the population, examples are the population mean, and the population standard deviation

Population parameters are denoted using the Greek alphabet.

32
Q

proportional stratified sampling - defintion

A

The frequencies for each group in the sample are often proportional to the frequencies for each group in the population

33
Q

sampling distribution - definition

A

all possible values of a statistic together with their associated probabilities

34
Q

What is a residual

A

observed data - predicted

data above the line = positive residual

data below the line = negative residual

35
Q

What letter represents correlation coefficient ?

What is the correlation coefficient for positive, negative and no correlation

A

R

positive = 1

negative = -1

no = 0

36
Q

What is bivariate data?

What is it represented on?

A

data which has pairs of values for two variables

represented on scatter diagrams

37
Q

What does an explanatory variable and dependent variable mean?

A

explanatory variable = independent variable

dependent = response variable

38
Q

When can you use binomial distribution?

A

When there is:

  • a fixed number of trials
  • when there are two possible outcomes (success and failure)
  • fixed probability of success
  • when trials are independent of each other
39
Q

what is the definition of a significance level?

A

the probability of the data being in the critical region

40
Q

State the assumption involved with using class midpoints to calculate an estimate of
a mean from a grouped frequency table.

A

The spread of data values inside each class is evenly distributed around the
midpoint.

41
Q

When do you use normal distribustion? What are the assumptions with normal distribution ?

A

Used for large quantitative data

Symmetry around its mean. This means that the mean, median, and mode are all equal and located at the center of the distribution.

Bell-shaped Curve: The distribution has a bell-shaped curve, meaning that it has a single peak and tails that extend indefinitely in both directions.

Independence: The observations or measurements are assumed to be independent of each other. This assumption is important because the normal distribution assumes that the values do not influence each other’s probabilities.

Continuous Data: The normal distribution is appropriate for continuous data, where the values can take any real number. It may not be suitable for discrete or categorical data.

42
Q

When can you not use normal distribution?

A

Skewed Data: If your data is significantly skewed, meaning it is asymmetric with a long tail on one side, the normal distribution may not accurately represent the underlying distribution.

Outliers: When your data contains outliers, extreme values that are significantly different from the majority of the observations, the normal distribution may be sensitive to these outliers. Outliers can strongly influence the mean and standard deviation.

Categorical or Discrete Data: The normal distribution is suited for continuous data, where values can take any real number. However, if your data is categorical (e.g., yes/no, red/blue/green) or discrete (e.g., counts or whole numbers), the normal distribution is not applicable.

Small Sample Sizes: When you have a small sample size, the assumption of normality may be difficult to verify, and the distribution of your data may deviate from the normal distribution.

Non-Linear Relationships

43
Q

When can you not use binomial distributon

A
  • multiple outcomes instead of just purely success or failure

Dependent Trials: The binomial distribution assumes that each trial is independent of the others

Continuous Data: The binomial distribution is designed for discrete data, where the outcomes are counted or represented as whole numbers.

large sample: Sample Size Too Large:
When the sample size is very large, the assumptions of the binomial distribution may not hold. One of the key assumptions of the binomial distribution is that the trials are independent and identically distributed. With an extremely large sample, it is possible for the independence assumption to be violated, as individual trials may become correlated. In such cases, alternative distributions like the normal distribution or the Poisson distribution might be more appropriate approximations.

When the sample size is small, it can lead to unstable estimates and imprecise results. The binomial distribution assumes a fixed number of independent trials, and with a small sample size, there may not be enough data to accurately estimate the underlying probability of success (p) for each trial.