Behavioural Analytics Flashcards

Question

What other way can we change the smoothing?

Answer 1

REstricted Maximum Likelihood method model <- gam(response ~ s(predictor, k = 15), method = "REML", data)

Answer 2

- Adding confidence intervals - Adding residuals

Answer 3

Adding confidence intervals (variability bands) and shading them. plot(model, se = TRUE, shade = TRUE, shade.col = "rosybrown2")

Answer 4

plot(model, se = TRUE, shade = TRUE, shade.col = "rosybrown2", residuals = TRUE, pch = 1, cex = 1)

Answer 5

As it is an "additive" model, the separate components simply add to create the overall model. Looking at more than one function in the same model model <- gam(response ~ s(pred1) + s(pred2), data) Or looking at them together as an interaction model <- gam(response ~ s(pred1, pred2), data)

Answer 6

Include linear variables model <- gam(response ~ s(pred1) + pred2, data) Include factor/categorical variables model <- gam(response ~ s(pred1, by = sex), data)

Answer 7

Two differing scales to interact

Answer 8

Generalised additive mixed model They are the multi-level mixed model form. They are more sophisticated.

Answer 9

geom_point(shape = 1) creates a hollow circle

Answer 10

geom_abline() - diagonal line geom_hline() - horizontal line geom_vline() - vertical line

Answer 11

coord_cartesian(xlim = c(-1, 3), ylim = c(-1, 3))

Answer 12

In a real world data collection situation we never get data that falls along a straight line. We always have some aspects of the data that are not explained by the model.

Answer 13

The difference between the actual value and the value predicted by the model (y-ŷ) for any given point

Answer 14

predict(model)

Answer 15

residuals(model)

Answer 16

+ geom_point(aes(y = predicted), shape = 1) Predicted must be added to the dataframe

Answer 17

+ geom_segment(aes(xend = Time, yend = predicted), alpha = 0.5)

Answer 18

Draws a straight line between two points - we use it to show the residual

Answer 19

geom_text(aes(y = predicted + (residuals / 2), label = paste0(round(residuals, 1))), nudge_x = 0.5, size = 2) nudge - so they don't overlap with the points

Answer 20

Removes the default shaded confidence interval

Answer 21

A statistical warning - it comprises four datasets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when visualised.

Answer 22

Curvilinear

Answer 23

Columns 1-4 (x1, x2, x3, x4) contain x-values. Columns 5-8 (y1, y2, y3, y4) contain corresponding y-values.

Answer 24

fullrange = TRUE

Answer 25

anscombeData <- data.frame() for (i in 1:4) anscombeData <- rbind(anscombeData, data.frame(set = i, x =anscombe[,i], y = anscombe[,i+4])) ggplot(anscombeData, aes(x, y)) + theme_bw() + geom_point(size = 3, color ="red", fill = "orange", shape = 21) + geom_smooth(method = "lm", fill = NA, fullrange = TRUE) + facet_wrap(~ set, ncol = 2)

Answer 26

set.seed(123)

Answer 27

randomError <- rnorm(mean=0, sd=0.5*sd(z), n=length(x)) Where z is the function outcome and x is the number of data points

Answer 28

se = FALSE - Removes Confidence Interval fill = NA - Removes the Fill Colour of the Confidence Interval

Answer 29

Adds a line which connects the points in the order they are given in the data frame.

Answer 30

Check our assumptions

Answer 31

Using the geom_smooth() function.

Answer 32

Loess if the number of observations < 1000 GAM if the number of observations > 1000

Answer 33

Locally estimated scatterplot smoothing

Answer 34

It is not as sophisticated. It draws the line but does not give us an output to interpret and understand (we need the GAM package for this)

Answer 35

A mixed modelling regression package.

Answer 36

The number of joining knots that are allowed in a spline

Answer 37

A piecewise combination of lots of little cubic sections

Answer 38

Emotion examples use intensive longitudinal data so we typically have lots of data.

Answer 39

Using the p-value of the reported statistic as we can see it tells us "Low p-value (k-index<1) may indicate that k is too low"

Answer 40

When using REML we have a p value that actually jumps around a bit. If we run this more than once we can see that the p value changes. This is telling us that something is wrong as this in not reliable.

Answer 41

REML is probably best as a default If you don't specify this it defaults to using Generalised Cross-Validation (GCV)

Answer 42

It typically means that the model is overfitting, underfitting, or failing to converge properly. If the model is unstable, the p-value from the k-check (which tests whether the basis dimension k is sufficient) changes each time you run the model. This variability is a sign that the model is not robust.

Answer 43

run gam.check(model)

Answer 44

The function gamSim() generates example datasets that are commonly used for demonstrating GAMs.

Answer 45

Better model generalisability

Answer 46

Concurvity is a term used in generalized additive models (GAMs) and similar non-linear models to describe a situation where two or more smooth terms (or predictor variables) are highly correlated in terms of their functional forms. In simpler terms, concurvity occurs when the smooth terms or predictors in the model are strongly related or move in similar ways, leading to potential issues with model estimation and interpretation.

Answer 47

An R Shiny App

Answer 48

- Initialisation code - A UI part - A server part

Answer 49

Model–View–Controller (MVC) The model is where the data is stored and manipulated - this is the server part in Shiny. The View part is the code for the user interface. The controller code glues these aspects together and deals with user inputs.

Answer 50

It gives you some fast and straightforward data in a shape you define, this allows you to quickly show how data will work throughout the model and UI parts of your app.

Answer 51

Collect information concerning the nature of the simulated data - this information will be sent to the server part of the Shiny app to feed into the simulation functions.

Answer 52

It is created as a vector with the seq() function. Time can be difficult to work with - one of the easiest ways is to treat it as an integer, using UNIX time/POSIX time.

Answer 53

By taking the number of seconds that have elapsed since Jan 01 1970 (UTC)

Answer 54

ARIMA stands for AutoRegressive Integrated Moving Average arima.sim() simulates data for these kinds of models

Answer 55

model = list(ar = input$autocorrelation) It takes the input from the autocorrelation slider in our UI as the model and uses it as a simple autoregressive (ar) model.

Answer 56

shiny::icon("heart-o")

Answer 57

UI - dygraphOutput() Server - renderDygraph

Answer 58

Dynamic graphic Dygraphs produces nice dynamic interactive graphs for time series style data.

Answer 59

(Exercise 4 in summary) data <- data %>% dplyr::select(time,Heartrate) %>% # Keeps only time and Heartrate columns dplyr::mutate(biometric = "HR") %>% # Adds a new column biometric with a constant value "HR", identifying the data type. dplyr::rename(value = Heartrate) %>% # Renames "Heartrate" to "value" dplyr::mutate(time_date = as.POSIXct(as.numeric(as.character(time))/1000, origin = "1970-01-01", tz="Europe/London")) # Convert UNIX Timestamp to Human-Readable Time return(data)

Answer 60

Using the tags command

Answer 61

Time series data must be presented in xts format. Type ?xts We opt for a POSIX form of data representation, using the POSIXct class with the as.POSIXct() function.

Answer 62

Interactivity

Answer 63

ECG processing detects peaks, extracts RR intervals, and computes NIHR, while HR processing simply plots BPM values. ECG processing extracts HRV metrics from the raw ECG signal, while HR processing only displays BPM trends. Additional plots in ECG - ECG processing allows advanced HRV visualizations, while HR processing is limited to basic HR plotting.

Answer 64

Taking a deeper dive into the data on an individual level

Answer 65

Seeks to try and combine the data for a view of the different data streams all at the same time.

Answer 66

dyRangeSelector()

Answer 67

eg within renderdygraph custom_palette <- c("red","blue","orange","green","darkgreen","purple") defined outside within dygraph: dyOptions(colors = custom_palette)

Answer 68

dyLegend(width = 500)

Answer 69

conditionalPanel(condition="$('html').hasClass('shiny-busy')",tags$div("Loading...",id="loadmessage")),

Answer 70

helpText("Click and drag on the plot to zoom and select date ranges"),

Answer 71

The plot presents the time series together in a series of faceted plots

Answer 72

The descriptive tab plots six plots (facet wrapped) so the GSR must be split into its two components data_SCL <- data_GSR() %>% select(time,SCL,time_date) %>% mutate(biometric = "SCL") %>% dplyr::rename(value = SCL) data_SCR <- data_GSR() %>% select(time,SCR,time_date) %>% mutate(biometric = "SCR") %>% dplyr::rename(value = SCR)

Answer 73

merge.zoo - biometric plot rbind - descriptive plot Use rbind() when you want a tidy, long-format dataset for faceted plotting (ggplot2) (vertical stacking) Use merge.zoo() when you need to align time series and create a wide-format dataset for dygraph()

Answer 74

Generalised Additive Mixed Models (GAMM) used to analyse biometric data and display key statistics in value boxes.

Answer 75

Create an infographic style display, and in particular we extracted summary statistics and present them in a best manner.

Answer 76

eg within renderValueBox() gamm.HR <- gam(value~s(time),data=data_HR(),method ="REML",correlation = corAR1()) summary_gamm.HR <- summary(gamm.HR) summary_gamm.HR <- round(summary_gamm.HR$edf,2) valueBox( summary_gamm.HR, "Heart Rate", icon = icon("heartbeat"), color = "red" )

Answer 77

This is a more complicated style of regression model that allows us to incorporate correlations in the data into the model. This is useful where the independence assumptions of a regression are violated, we can bring them into the model as an approach to dealing with that violation.

Answer 78

As these are time series data we have to be aware of autocorrelation and these gamm models allow us to deal with these issues at least to some extent.

Answer 79

eg tidyr::gather(key = emotion,value = emotion_score,c("joy","fear","disgust","sadness","anger","surprise")) Creates a longer format, where each row represents a time point and the corresponding score for each emotion. The wide format data (where each emotion has its own column) isn't suitable for plotting or time series analysis because it complicates visualizing how emotion scores change over time. gather() converts the data into a long format, where you can easily plot emotion scores (emotion_score) against time (time), making it simpler to work with for visualization.

Answer 80

The readr function creates a tibble rather than a dataframe. This is the Tidyverse version of a data.frame/data.table that works in a way that is compatible with Tidyverse code

Answer 81

A Gauge chart is a type of chart that uses a radial scale to display data in the form of a dial. renderGauge()

Answer 82

Supervised machine learning - we get an algorithm to learn the relationship that exists between some set of inputs and known outcome. This is carried out on a training set of data, then we provide a novel set of data as an input and the algorithm makes a prediction or classification based on the learnings of the training set.

Answer 83

In the context of machine learning, "ground truth" refers to the actual, true values or correct labels used for training, validation, or testing a machine learning model. Ground truth is the benchmark or standard that a model's predictions are compared against. It is the "truth" that we rely on when training models or evaluating their accuracy. It is an assumption or "operationalisation" of the truth - it typically depends on many theoretical assumptions that have been made when collecting material for the training set.

Answer 84

A poorly defined and collected training set of data means the algorithms developed will not function well when given new data from the real world, as it will not perform according to the poorly-defined "ground truth".

Answer 85

Reserve some of it as a test set. We want to be able to test the extent to which the algorithm can do its job. We want to test it on previously unseen data. This also helps us to investigate if the algorithm is overfit to the training data.

Answer 86

If the algorithm is overfit to a particular set of data, it will perform very well on this data (eg even to go as far as accounting for the unique statistical anomalies) and will not generalise well to unseen data.

Answer 87

Labelled data is usually expensive, labour intensive and takes a lot of time to create. Humans are often the people who create labelled data sets, and in effect the machine learning algorithms are trying to copy the human behaviour and set of decision processes that went into labelling data.

Answer 88

It is so expensive to create training sets and machine learning performs better with large quantities of data. It will almost certainly lead to creating an algorithm that is overfit. You need to choose the proportions of test and training data wisely.

Answer 89

Allows you to improve the accuracy of the algorithm. How well the algorithm performs on the test set provides an error metric that you can use to improve your algorithm by changing the parameters, or adding components to more complex models.

Answer 90

Labelling, annotating, rating and coding All mean someone observing data and making a judgement about it.

Answer 91

Using well trained experts with a well defined coding scheme. eg PhD students at universities have the time to create the labels or annotations. Due to the intensity of the work, large numbers of raters are not normally possible.

Answer 92

Use naive raters who do not need to be trained in as much depth. This typically requires a sample coding scheme and payment for the recruits. This is useful as control can be retained over the performance of the raters, but it takes a lot of organisation and teaching of the raters.

Answer 93

Crowdsourcing - using the web and internet to get access to a large number of raters. - eg set up a website and use gamification to get people to do labelling - eg use a paid crowdsourcing site eg Amazon Mechanical Turks, typically these tasks need to be very simple and there is very limited control over the people who participate

Answer 94

A coding scheme is a structured system used to classify, categorise, and interpret data. It involves assigning labels, numbers, or categories to different elements of a dataset based on predefined rules. - eg transcription - there could be variations in punctuation choices

Answer 95

There can be a lot of room for subjective judgement

Answer 96

We need to be able to check how well each of the subjective decision makers are in agreement to give some idea of the consistency of the coding scheme, and a measure of how objective its use its.

Answer 97

- Discrete (categorical or nominal) coding - Continuous (ordinal, interval or ratio variables) coding

Answer 98

Inter-rater reliability (IRR) is a measure of how much agreement there is between multiple raters or observers. It's used to ensure that data is consistent and reliable, regardless of who collects or analyses it.

Answer 99

Observed score = True score + Measurement Error Var(X) = Var(T) + Var(E) ie the variance (the variability in the scores we observe) can be thought of in a useful way by realising that it is made up of the actual true score that we want to observe

Answer 100

To give an estimate of how much of the true scores we are getting. eg an IRR estimate of 0.8 indicates that 80% of the observed variance is due to the true score variance, or similarity in ratings between coders and 20% is due to error variance or differences in ratings between coders.

Answer 101

Percentage rater agreement - This captures the amount of times two raters agree in a very simple sense % Agreement = (no. observations agreed by raters) / (total no. observations)

Answer 102

Categorical data. It does not work so well for continuous data, where some sort of agreement interval is needed (ie tuning the continuous data into a form of categorical data).

Answer 103

It does not take into account chance agreement

Answer 104

In a simple classification problem where there are only two categories - it is likely that a lot of agreement occurs by chance. eg rating cells as cancerous or not - about 1 in 20 cells is expected to be cancerous (p = 0.05)

Answer 105

The agree() function from the IRR package.

Answer 106

Cohen's Kappa

Answer 107

-1 to +1 0 - no agreement 1 - perfect agreement Negative - systematic disagreement

Answer 108

cbind() Each column is a different rater and rows are subjects

Answer 109

It is limited to cases where there is categorical data and two raters (nominal - Hallgren). If we want to have faith in our ground truth, we would rather have it coming from more than the opinion of just two raters.

Answer 110

Intraclass correlation coefficient (ICC)

Answer 111

Shrout and Fleiss

Answer 112

6 different ways Appropriate depending on the characteristics of the data and the goals of the researchers

Answer 113

The spread of ratings across the whole data set. Due to expense and time associated with rating, often sets of ratings are partially coded by different people. Eg full coding: every rater coded every subject Eg one rater has much more time available - a subset of the material is coded by other fathers to ensure that the main rater is coding according to the coding scheme (often 10% may be coded by others) Eg (Often in online ratings with many naive raters) different subsets of material are rated by different people, in a way that means no single rater rates all the data. ICC can handle all of these situations but you need to be aware of what style of rating you are using.

Answer 114

Fully crossed - two way model

Answer 115

We have information about all of the raters rating all of the subjects. This allows you to see how the two things interact - the ratings and the subjects that they rate.

Answer 116

- If it is fully crossed or not - How the ratings should be interpreted (absolute values or consistency of ratings) - The way the coding is set up (average or single measures) - Wether coders selected for the study are considered to be random or fixed effects

Answer 117

Both Inter-Rater Reliability (IRR) and Intraclass Correlation Coefficient (ICC) measure agreement between raters, but they differ in what they measure and how they are calculated.

Answer 118

In a fully crossed design, the ICC can take into account systematic deviations between the coders because it has that information

Answer 119

Models which are not fully crossed do not have enough information so the systematic deviation must be left out

Answer 120

- Two way model: when it is fully crossed - One way model: when it is not fully crossed

Answer 121

- In the not fully crossed, only information about the ratings (r) can be used - In the fully crossed, both information about the ratings (r) and raters/coders (c) can be used and there is in an interaction between them, making this a two way design (rc)

Answer 122

- Sometimes we are interested in absolute values ie that the raters get the right value correct - Eg the intensity of smiles - Sometimes we are interested in the consistency of ratings ie how the ratings change, we want to see if values go up and down in the same way but it does not matter if the exact numbers are different.

Answer 123

- We can use the average of all the raters to calculate the ICC in a fully crossed design - we have enough information to use this and it will allow more confidence and a higher ICC as we have more of the relevant data - When we use a subset of ratings to justify the ratings of a single coder, we have to say it is a single measures, which is a more conservative calculation

Answer 124

- If coders are selected from a larger population and the ratings are meant to generalise to the population, can use random effects model - Random Model - If you do not wish the generalise the results to a larger population of coders or if coders in the sample are not randomly sampled, use a fixed effects model (subjects considered random by coders considered fixed)

Answer 125

Different types of ICC exist depending on the study design and whether raters are considered random or fixed effects. Shrout and Fleiss

Answer 126

Uses nomenclature from McGraw and Wong C - consistency A - absolute agreement

Answer 127

eg ICC(1,1) icc(dataicc1, model="oneway", type="agreement", unit="single")

Answer 128

Missing data

Answer 129

A package has become available on CRAN that helps deal with missing data irrNA - copes with randomly missing data

Answer 130

Columns - raters Rows - subjects May need to transpose the data

Answer 131

A modern approach to IRR which can be used for all kinds of data (eg nominal, ordinal, interval, ratio etc). It is newer and is therefore not as familiar as other methods (as not been adopted as widely). It is also robust to missing values.

Answer 132

Interval data

Answer 133

kripp.alpha() Pass in data and "type" of data if applicable May need to transpose the data

Answer 134

The ability to apply the same measurement across different forms of data.

Answer 135

It does not have the flexibility offered by the varieties of icc that we can engage in

Answer 136

ICC(1) This is the special case Krippendorf's alpha

Answer 137

The data format expected by kripp.alpha() is pivoted from that expected by icc

Answer 138

They are a way to describe a set of strings. They allow us to create patterns that can then be used to search and replace very efficiently.

Answer 139

[0-9] will find all numbers, but to find all numbers longer than 1 digit you need to add a plus at the end [0-9]+ [A-Za-z0-9]+ finds uppercase letters, lowercase and digits.

Answer 140

A wildcard - it can match zero or more characters

Answer 141

Matches something one or more times Regex: go+d Matches: god, good, goood (but NOT gd)

Answer 142

Matches something zero or one times a? → Matches "", "a" (but NOT "aa") colou?r → Matches "color" and "colour"

Answer 143

Matches numbers in the same way as [0-9]

Answer 144

matches any word character like [a-zA-Z], add a + to get full words.

Answer 145

str_detect() In our example, we specified the column the text was in str_detect(headlines$title, "word1|word2|word3") str_detect(headlines$title, "[0-9]+") - for headlines with numbers str_detect(headlines$title, "\"[a-zA-z\"]") - for headlines with quotes - this requires the escape character \

Answer 146

str_match to return matched patterns eg str_match(headlines$title, "word")

Answer 147

str_subset to return matched lines eg str_subset(headlines$title, "word")

Answer 148

str_replace str_replace_all(headlines$title, "Cameron", "Pancake")

Answer 149

from datasets import load_dataset imdb_dataset = load_dataset("imdb")

Answer 150

dataset.shape dataset.num_columns dataset.num_rows dataset.column_names type(dataset)

Answer 151

We have to get our text into a shape that can be used by these pre-trained models. They each expect the text to come in a very precise format with special tokens added to the text that inform the model of the start and end of sentences for example. eg BERT - get turned into tokens that have integer labels and there are a few special tokens that need to be used to delimit the boundaries of the sentences.

Answer 152

[CLS] - all sentences start with a special token [SEP] - all sentences end with [UNK] - when a word is unknown [PAD] - fills out the empty space at the end of sentences, all of our sentences need to be the same length to keep our matrix square

Answer 153

from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") This now contains the tokeniser associated with the "bert-base-cased" model

Answer 154

Fully-crossed designs with exactly two coders

Answer 155

ICCs incorporate the magnitude of the disagreement to compute IRR estimates. Larger magnitude disagreements result in lower ICCs than smaller-magnitude disagreements.

Answer 156

Average-measure ICCs higher than single-measure ICCs.

Answer 157

List-wise deletion Therefore it cannot accommodate datasets in fully-crossed designs with large amounts of missing data - krippendorff's alpha may be more suitable when problems are posed by missing data in fully costed designs.

Answer 158

When you have all subjects rated by all coders. The researcher is likely interested in the reliability of the mean ratings provided by all coders.

Answer 159

That ICC = 0

Answer 160

install.packages("readr")

Answer 161

The sum of squares of the residuals print(sum(curveData$residuals^2)) or anova(model)

Answer 162

plot(model, 1) - we want the first plot: residuals against the predicted model / fitted values

Answer 163

appraise(model)

Answer 164

draw(curveGAMModel, rug = FALSE)

Answer 165

model$smooth[[1]]$bs.dim

Answer 166

data <- read.csv("file.csv) - str(data) - glimpse(data) - head(data)

Answer 167

model stability

Answer 168

stableData1 <- dplyr::select(stableData1, y, x2) %>% arrange(x2)

Answer 169

concurvity(model, full = TRUE) full = FALSE for pairwise comparison Want things to be < 0.8

Answer 170

?smooth.terms

Behavioural Analytics Flashcards

(202 cards)