W3: Data Visualization Flashcards
What are the 2 methods of scoring scales?
- Added together for total sum score
- Average of all items
What are the two ways of averaging item scales using rowMeans()
- Normal, just average
- Multiple by number of items after averaging
What is .SD
- Refers to Subset (S) of Data (D)
- On its own returns all data you’re working on
E.g unicorn[ , .SD] - .SDcols tells data.table what columns you want
db[, StressAVG := rowMeans(.SD, na.rm = TRUE), .SDcols = c(“PSS1”, “PSS2”, “PSS3”, “PSS4)]
What function is used to calculate reliability of a scale?
psych::alpha()
* refers to Cronbach’s alpha
What should you add to psych::alpha when using reverse scored scales?
- check.keys = TRUE
E.g psych::alpha( as.data.frame( db
[, .(PSS1, PSS2r, PSS3r, PSS4)]),
check.keys = TRUE)
What do aesthetics do and what are 4 examples of them in ggplot2?
- Controls how geometrics are displayed
- Size, shape, colour, transparency level
What are 4 common geoms_ used for univariate graphs?
geom_histogram( ) , geom_density( ) , geom_dotplot( ), geom_qq( )
What argument does geom_qq() need?
scale( predictor ) to z-score data
* z = (x-mean) / SD
geom_abline( intercept = 0, slope = 1): line where all points would fall if normally distributed
What function is used to check distribution?
plot( testDistribution() )
E.g plot(testDistribution(db$Stress,
extremevalues = “theoretical”, ev.perc = .005))
What 3plots are shown from using plot( testDistribution() )?
Density plot, rug plot, deviates plot
What do you need to do when mapping additional categorical variables onto graphs?
Convert variable into a factor
* db [, sex := factor ( sex, levels = c(1,2),
labels = c(“male”, “female))]
* ggplot(db[!is.na(sex)], aes(Stress, colour = sex)) + geom_density()
When using geom_histogram, it is more helpful to control what?
Fill colour
e.g ggplot(db[!is.na(sex)], aes(Stress, fill = sex)) +
geom_histogram()
What is the argument to have bars side by side when using geom_historgram?
geom_histogram(position = “dodge”)
* bars are stacked by default
What are 3 common geoms used for bivariate graphs?
geom_point() scatter plot, geom_line(), geom_bar(stat = “identity”) for values to be actual bar height
What is best practice for data visualization?
More data, less ink
What are 4 ways to reduce ink and provide more data in graphs?
- Remove background borders - theme_pubr()
- Remove axis lines - theme(axis.line = element_blank() )
- Replace geom_bar with geom_point
- Using shapes for values - scale_shape_manual(
name = “Sex”,
values = c(“male” = 1, “female” = 3))
How do you change axes to only go the range of observed data?
Using geom_rangeframe()
What are 2 ways to add [interquartile] break points to axis labels?
- Using quantile()
* scale_x_continuous(breaks = as.numeric(quantile(db$Stress)))
* scale_y_continuous(breaks = as.numeric(quantile(db$SE))) - Using scale_x/y_discrete
scale_y_continuous(labels = percent) +
scale_x_discrete(
breaks = c(“High SE”, “Low SE”),
labels = c(“High SE (median)”, “Low SE (median)”))
What are the functions used for boxplot with raw data shown?
geom_boxplot() + geom_jitter()
What are the 2 ways to provide mean and/or 95% CI on graphs?
- stat_summary(fun.data = mean_cl_normal)
- Using prop.test
LL = prop.test
(x = sum(sex == “female”, na.rm = TRUE), n = sum(!is.na(sex)), correct = FALSE)$conf.int[1],
UL = prop.test(
x = sum(sex == “female”, na.rm = TRUE),
n = sum(!is.na(sex)), correct =FALSE)$conf.int[2])
What should you do before graphing all categorical variables?
Make 1 variable “continuous” by getting their percentages using egltable()
What function provides multi-panel plot which is useful for all categorical variables graphing?
facet_grid and/or coord_flip
What is the common graph/geom for all continuous variables?
geom_point i.e scatter plot
What are 4 things you can add to a graph with all continuous variables?
- correlation coeff and p-values using
cor.test(~ SE + Stress, data = db) - regression line using
stat_smooth(method = “lm”) - text annotation using
annotate(“text”, x = max(db$Stress), y = max(db$SE),
label = “r = -0.65, p < .001”,
size = 6, hjust = 1, vjust = 1) - histograms to margins using
ggMarginal( x, type = “histogram”)
How do you make more space for long axis labels?
ggarrange( ggtitle ( “rotate text”) or (“rotate graph”)
What are 3 ways to improve geom_dotplot visualizations?
- binwidth = .1 to shrink dot size
- alpha = .2 for dot transparency
- y = jitter to add noise of scores
What are 2 scenarios you would use geom_violin?
- For large datasets
- To compare distributions across variables
What is a benefit of using rowMeans instead of simply adding all variable scores together?
it imputes the mean for a person with missing data