Week 1 ML Flashcards
Scientific Approach
- Systematic pursuit of knowledge
- Logical steps: Problem - Hypotheses
- Data collection: Observation of behaviour or experimentation
- Test hypotheses and draw conclusions
Research Methods
The systematic approach to answering questions
Statistics
- Numbers that summarise observations
- Mathematical procedures to produce those numbers
Scientific Method
- Theory
- Hypothesis
- Exp/Observe (research methods)
- Evidence (statistics)
- Theory
Testing hypothesis
- Reproducible observation of the hypothesised effect in action
- Controlled (reproducible) circumstances
- Empirically observed
- Variables measured and/or controlled
- Alternative explanations controlled/eliminated
- Observation/interpretation is unbiased
The scientist practitioner model
Furthers understanding through research
• Consumers of research
• Evidence-based practice
• Inform own practice and methodology
Scientific Enquiry
- Choose something to observe
- Choose method of observation
- Describe observations
- Identify variation in observations
- Explain variations
Types of data
Categorical
- > Nominal
- > Ordinal
Continuous
- > Interval
- > Ratio
Nominal data
refers only to identity information, that is values are ascribed that have no inherent order, or magnitude.
For example, gender, nationality, or the number assigned in a race are all types of nominal data
-> names of things without meaning
Ordinal data
describes identity, but has magnitude.
For example, medal positions in a race are types of ordinal data. They have a sequential order (the gold medalist beat the silver medalist beat the bronze medalist), but this measure doesn’t tell us anything about the interval between each competitor, they are categorised as 1 - 2 - 3
-> data that is ordered without fixed intervals
Interval data
a continuous type of variable, measuring identity and magnitude and fixed intervals between units of measurement.
For example, temperature. Here, the difference between 20 degrees and 30 degrees is the same as between 60 degrees and 70 degrees. We can order our data points by magnitude as we do with ordinal data, but we can also quantify the amount of difference between data points.
-> data that has fixed intervals allowing us to order it
Ratio data
identity, magnitude, fixed interval and there is a true zero.
For example, the time a race is run cannot be a negative value.
-> time, height, where there are fixed intervals but there is a “true zero” which the data can not run under
Descriptive statistics
- Each observation is a “Datum”: Plural is “Data”
- A bunch of data is often called a “Data Set”
- Different types of data are analysed in different ways
- Most basic description is how frequently similar observations occurred
- Easiest description to follow is a picture
Data through pictures
- Bar graph
- Line graph
- Pie chart
- Scatter Plot
Categorical data through pictures
- Pie chart
- Bar graph
Continuous data through pictures
- Histogram
- Box plot
Histogram components
- Title
- X label and data (bins/classes of variable)
- Y label and data (frequency/total)
Negative skew
skewed to the left.
There are more points towards the higher end of the x axis.
-> right side higher
Positive skew
skewed to the right.
There are more points towards the lower end of the x axis
-> left side higher
Kurtosis
- Kurtosis refers to the ‘peakedness’ of a distribution, the relative concentration of values at the centre, tails or shoulders of the histogram.
- Normal distributions have the distinctive bell shaped curve.
- Positive kurtosis have an overly high concentration of values at the centre, giving a pronounced central peak.
- Negative kurtosis are much flatter, and have a less distinctive peak.
Number of peaks
1: unimodal
2: bimodal
3+: multimodal
Spread
We can have distributions that give the histogram a thinner or thicker appearance. This would tell us that our participants vary very little, or vary more. Again - we can use pictures with numbers to describe the spread of data.
Deviations from pattern
- Outliers
Shape
- skew
- kurtosis
- ## peaks
Deviations from mean
above the mean
-> positive deviation
below the mean
-> negative deviation
*sum of the deviations is equal to zero
Skewed distribution reporting
- do not use the mean because it is effected by outliers whereas median is a resistant measure of centre
- mode does not always report centre accurately
- typically report using median
Which average to use
categorical: mode
continuous and skewed: median
continuous and normal dist: mean
Spread
- variability
- minimum and maximum
- quartiles
Variability
when people are quite similar to each other - we have low variability in scores
when people are quite different from each other - there is high variability in the data.
Minimum and maximum
probably the easiest measure of spread to calculate. Quite simply, the lowest and highest values. This gives an indication of the range over which the scores occur
- not robust to outliers
Quartiles
- arranging the data from least to most value and then splitting into quarters with the median being in the centre
- Q1 and Q2 are below the median and Q3 and Q4 are above the median
- fairly robust to outliers, a useful measure of spread
Standard Deviation
- a rough measure of the average amount by which scores deviate from the mean
- majority of data will fall between one standard deviation of mean
- minority of data will be outside two standard deviations of mean
Keeping outliers
Reason: Can’t pick and choose your data
Pro: You are considering all of your data set
Con: Your measure of the mean and standard deviation will be overly influenced by the outlier
Disregarding outliers
Reason: This individual is not representative
Pro: Your mean and standard deviation are less influenced by one individual
Con: Your statistics will not apply to all of the individuals
Pie charts
- Divide up circles (the pie) into different areas (the slices).
- Each slice represents the percentage values or counts for categorical variables
- The size of each slice of pie should be proportional to the percentage or count
- difficult to make by hand
- work for 10 categories or less
- useful for proportional categorical data
Bar graph
- can show data across more than one variable
- can show error bars for SD
- difficult with many categories
- difficult with categories close in value
- useful for counts (frequency) or summary of data
Stem and leaf plots
- similar to histograms
- ten’s on the left and singles after it
Box plots
- box and whisker plots
- can show distribution of continuous variable
- show median surrounded by box of Q1 and Q3
- “whiskers” can be min and max points
- “whiskers” typically min and max points within upper and lower fences
- upper and lower fences are Q1 - 1.5xIQR and Q3 + 1.5xIQR
- can show skew through where the median line sits
- > median line lower when positive skew
- > median line higher when negative skew
Showing data
- shape of data - this is best achieved using pictures, such as histograms of the data.
- describing the centre and spread of data are best achieved using numbers, such as our means, medians and modes for centre - our minimum and maximum values, quartiles, variance and standard deviations for spread.
- describing deviations from what is ‘normal’ in our sample require both pictures (to show where the data is) and numbers (to describe how it deviates based on rules such as the Interquartile range rule).