Week 1 ML Flashcards

1
Q

Scientific Approach

A
  • Systematic pursuit of knowledge
  • Logical steps: Problem - Hypotheses
  • Data collection: Observation of behaviour or experimentation
  • Test hypotheses and draw conclusions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Research Methods

A

The systematic approach to answering questions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Statistics

A
  • Numbers that summarise observations

- Mathematical procedures to produce those numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Scientific Method

A
  • Theory
  • Hypothesis
  • Exp/Observe (research methods)
  • Evidence (statistics)
  • Theory
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Testing hypothesis

A
  • Reproducible observation of the hypothesised effect in action
  • Controlled (reproducible) circumstances
  • Empirically observed
  • Variables measured and/or controlled
  • Alternative explanations controlled/eliminated
  • Observation/interpretation is unbiased
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

The scientist practitioner model

A

Furthers understanding through research
• Consumers of research
• Evidence-based practice
• Inform own practice and methodology

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Scientific Enquiry

A
  1. Choose something to observe
  2. Choose method of observation
  3. Describe observations
  4. Identify variation in observations
  5. Explain variations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Types of data

A

Categorical

  • > Nominal
  • > Ordinal

Continuous

  • > Interval
  • > Ratio
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Nominal data

A

refers only to identity information, that is values are ascribed that have no inherent order, or magnitude.
For example, gender, nationality, or the number assigned in a race are all types of nominal data
-> names of things without meaning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Ordinal data

A

describes identity, but has magnitude.
For example, medal positions in a race are types of ordinal data. They have a sequential order (the gold medalist beat the silver medalist beat the bronze medalist), but this measure doesn’t tell us anything about the interval between each competitor, they are categorised as 1 - 2 - 3
-> data that is ordered without fixed intervals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Interval data

A

a continuous type of variable, measuring identity and magnitude and fixed intervals between units of measurement.
For example, temperature. Here, the difference between 20 degrees and 30 degrees is the same as between 60 degrees and 70 degrees. We can order our data points by magnitude as we do with ordinal data, but we can also quantify the amount of difference between data points.
-> data that has fixed intervals allowing us to order it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Ratio data

A

identity, magnitude, fixed interval and there is a true zero.
For example, the time a race is run cannot be a negative value.
-> time, height, where there are fixed intervals but there is a “true zero” which the data can not run under

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Descriptive statistics

A
  • Each observation is a “Datum”: Plural is “Data”
  • A bunch of data is often called a “Data Set”
  • Different types of data are analysed in different ways
  • Most basic description is how frequently similar observations occurred
  • Easiest description to follow is a picture
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Data through pictures

A
  • Bar graph
  • Line graph
  • Pie chart
  • Scatter Plot
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Categorical data through pictures

A
  • Pie chart

- Bar graph

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Continuous data through pictures

A
  • Histogram

- Box plot

17
Q

Histogram components

A
  • Title
  • X label and data (bins/classes of variable)
  • Y label and data (frequency/total)
18
Q

Negative skew

A

skewed to the left.
There are more points towards the higher end of the x axis.
-> right side higher

19
Q

Positive skew

A

skewed to the right.
There are more points towards the lower end of the x axis
-> left side higher

20
Q

Kurtosis

A
  • Kurtosis refers to the ‘peakedness’ of a distribution, the relative concentration of values at the centre, tails or shoulders of the histogram.
  • Normal distributions have the distinctive bell shaped curve.
  • Positive kurtosis have an overly high concentration of values at the centre, giving a pronounced central peak.
  • Negative kurtosis are much flatter, and have a less distinctive peak.
21
Q

Number of peaks

A

1: unimodal
2: bimodal
3+: multimodal

22
Q

Spread

A

We can have distributions that give the histogram a thinner or thicker appearance. This would tell us that our participants vary very little, or vary more. Again - we can use pictures with numbers to describe the spread of data.

23
Q

Deviations from pattern

A
  • Outliers
24
Q

Shape

A
  • skew
  • kurtosis
  • ## peaks
25
Q

Deviations from mean

A

above the mean
-> positive deviation

below the mean
-> negative deviation

*sum of the deviations is equal to zero

26
Q

Skewed distribution reporting

A
  • do not use the mean because it is effected by outliers whereas median is a resistant measure of centre
  • mode does not always report centre accurately
  • typically report using median
27
Q

Which average to use

A

categorical: mode

continuous and skewed: median

continuous and normal dist: mean

28
Q

Spread

A
  • variability
  • minimum and maximum
  • quartiles
29
Q

Variability

A

when people are quite similar to each other - we have low variability in scores
when people are quite different from each other - there is high variability in the data.

30
Q

Minimum and maximum

A

probably the easiest measure of spread to calculate. Quite simply, the lowest and highest values. This gives an indication of the range over which the scores occur
- not robust to outliers

31
Q

Quartiles

A
  • arranging the data from least to most value and then splitting into quarters with the median being in the centre
  • Q1 and Q2 are below the median and Q3 and Q4 are above the median
  • fairly robust to outliers, a useful measure of spread
32
Q

Standard Deviation

A
  • a rough measure of the average amount by which scores deviate from the mean
  • majority of data will fall between one standard deviation of mean
  • minority of data will be outside two standard deviations of mean
33
Q

Keeping outliers

A

Reason: Can’t pick and choose your data
Pro: You are considering all of your data set
Con: Your measure of the mean and standard deviation will be overly influenced by the outlier

34
Q

Disregarding outliers

A

Reason: This individual is not representative
Pro: Your mean and standard deviation are less influenced by one individual
Con: Your statistics will not apply to all of the individuals

35
Q

Pie charts

A
  • Divide up circles (the pie) into different areas (the slices).
  • Each slice represents the percentage values or counts for categorical variables
  • The size of each slice of pie should be proportional to the percentage or count
  • difficult to make by hand
  • work for 10 categories or less
  • useful for proportional categorical data
36
Q

Bar graph

A
  • can show data across more than one variable
  • can show error bars for SD
  • difficult with many categories
  • difficult with categories close in value
  • useful for counts (frequency) or summary of data
37
Q

Stem and leaf plots

A
  • similar to histograms

- ten’s on the left and singles after it

38
Q

Box plots

A
  • box and whisker plots
  • can show distribution of continuous variable
  • show median surrounded by box of Q1 and Q3
  • “whiskers” can be min and max points
  • “whiskers” typically min and max points within upper and lower fences
  • upper and lower fences are Q1 - 1.5xIQR and Q3 + 1.5xIQR
  • can show skew through where the median line sits
    • > median line lower when positive skew
    • > median line higher when negative skew
39
Q

Showing data

A
  • shape of data - this is best achieved using pictures, such as histograms of the data.
  • describing the centre and spread of data are best achieved using numbers, such as our means, medians and modes for centre - our minimum and maximum values, quartiles, variance and standard deviations for spread.
  • describing deviations from what is ‘normal’ in our sample require both pictures (to show where the data is) and numbers (to describe how it deviates based on rules such as the Interquartile range rule).