Lec 19 and 20 Flashcards

1
Q

plenty of times, context known, looks unimportant?

A

exploring data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

can afford detail but looks vital?

A

results in paper

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

appeal from afar but have close up detail

A

poster at conference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

must sell idea in seconds

A

presentation or talk at conference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

counts in categories?

A

table or barchart which is barplot in R - these show frequency more clearly than pie chart

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

why not to use pie chart

A

Pie charts are usually a poor way of representing frequency (count) data because the human eye is far better at comparing lengths than angles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Does a paper graph need colour and title?

A

no

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

stacking of groups,. helpful or no?

A

This might be useful if you are particularly interested in the sums across the primary x-axis categories, but it isn’t easy to compare the breakdown by the second categorical variable. For this, it is easier to have unstacked bars, side by side.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

when to use colour?

A

for talk or poster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is a mosiac plot for?

A

Mosaic plot for exploration of complex contingency tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

how many colours does default R palette have?

A

8

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

how to overcome overlap?

A

use open symbols and jitter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What chart is used to represent distribution in diff groups

A

stripchart and histogram

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

when to represent anaylsis and when to represent the data themselves?

A

analysis for talk/poster
data themselves for paper (although many articles present analysis summaries, so best to use both so reader can decide if right analysis has been used)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

repeated measures - what to plot and what not to plt

A

dont plot the simple treatment medians (or means) as it conceals the paired design. Instead, Plot the differences between treatments (but only really viable for two treatments)
for two or more treatments, remove subject differences first

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

plotting correlation or regression

A

y ~ x as it matches the analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

regression line?

A

Ordinary Least Squares regression

When you are predicting y from x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

reduced major axis reg line?

A

only when no prediction involved or causation implied

for correlation or allometry

19
Q

relationship between many variables

A

pairs () uses whole dataset

20
Q

shingling

A

break down the second continuous variable into categories. By default, the ‘shingles’ overlap by 50%

21
Q

how to make 3d plot easier to read

A

add drop lines or Adding the plane describing the regression of z on y and x

22
Q

How is analysis represented? and how is data represented?

A

anaylsis represented using means and CI

data represented using boxplots

23
Q

what is data mining

A

True exploration when you don’t know what patterns might exist
May have a large number of candidate predictors and no strong theory to predict which should be more important
And as the number of variables goes up, the number of possible interactions goes up geometrically
** Before ‘data mining’ check for missing values and outliers and ‘clean up’ the data

24
Q

what does vertical spacing in CART represent

A

variance explained

25
when pruning a tree in CART do errors go up or down with no of predictors?
error goes down with no of predictors
26
what is Cross-validation?
For any classification method, your ‘best model’ will always perform well on the sample you developed the model for So a risk of over-fitting So, build model using one set of data and test it on another (=cross-validation) partitioning the original sample into a training set to train the model, and a test set to evaluate it
27
pos and negsof CART
``` It’s crude: doesn’t make use of continuous information; simply splits such variables as ‘high’ or ‘low’ But it’s robust Variables can have any distribution And powerful Examines all interactions And gives results in an easy-to-use form A decision tree ```
28
two classes of clusters - definitions
Supervised’ learning – know the true identity of some clusters and use these to develop a predictive model for data where you don’t know group membership Unsupervised’ learning – don’t know what’s ‘right’ or ‘wrong’, so try and find natural clustering patterns in the data (which can then be used in future prediction)
29
examples of supervised and unsupervised
supervised -Discriminant function analysis Logistic regression (two groups) Multinomial logistic regression (>2 groups and no order) Support Vector Machines Neural networks Genetic algorithms unsupervised - k-means clustering Specify number of clusters to find Allocates data to clusters to minimise within-cluster sums-of-squares
30
when to use CART vs clustering?
Use robust methods like CART to explore predictive relationships Use clustering to expose unexpected ‘structure’ (groupings) in your data
31
what is multinomial logistic regression
Multinomial logistic regression (>2 groups and no order)
32
plotting raw data?
stripchart or dotplot
33
a histrogram represents..
partially summarised data
34
plotting continuous data
stripchart, sctterplot, histogram (to explore whole distributions)
35
notched boxplot
If the notches of 2 boxes don’t overlap, medians are likely to differ
36
how much of the data is represented in a boxplot? and what is the interquartile range?
A boxplot consists starts at the 25th percentile and ends at the 75th percentile, so this box contains 50% of the data. The interquartile range is the difference between the 25th and 75th percentiles, which are also referred to as the lower and upper quartiles.
37
ellipses package
Simplify correlations to ellipses with hue & intensity indicating direction & strength
38
which test is this?..the slopes are assumed to be the same (parallel) for each level of the categorical factor..
ANCOVA - can represent in sep panel or Or in same plot, with separate regression lines USE SHINGLING to help visualisation
39
what graph summaries the data?
boxplot
40
what is inferential statisitcs?
testing hypothesis, makes inferences about populations using data drawn from the population.
41
what is a conditioning plot and when is it used?
for investigating the relationship between multiple variables , is a plot of y against x conditioned upon (or broken down by) a third variable (or more if you have them). If the third variable is categorical (a factor with discrete levels/groups), then you get a plot of y against x for each level of the third variable, each in a separate panel If the third variable is continuous, the third variable is broken into similarly-sized groups, somewhat overlapping, and you get a plot of y against x for each of those.
42
what is k-clustering?
allocates data to clusters to minimise within-cluster sum of squares
43
how does hierarchical or agglomerative clustering work?
need a measure of distance between points, join the clostest two points, when two or more are joined they are treated as an item with a single 'location' ... once all joined up, then create a pattern of linkage tree
44
when to use CART vs hierarichcal/agglomerative?
CART for exploring predictive relationships, agglo/hierrchical to expose unexpected structures.