S1 Study Flashcards

1
Q

What are the five steps of data science?

A
Ask
Get
Explore
Model
Communicate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What type of variable is gender?

A

Categorical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What type of variable is height?

A

Numerical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the name of the bar’s width on a histogram?

A

Bin size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the three V’s of data science?

A

Volume
Velocity
Variety

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does EDA stand for?

A

Exploratory Data Analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the goal of EDA?

A

To understand the data better and search for patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is data wrangling?

A

Transforming raw data to be usable later in the process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Are boxplots used when

  • there are lots of outliers?
  • the data is skewed?
  • the data has high dimensionality?
  • we want to see some specific features?
A

When the data is skewed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is GDPR?

A

General Data Protection Regulation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

According to GDPR, can anyone ask a search engine to remove irrelevant data from their search results?

A

Yes

This stands regardless of where their servers are stored since their services are offered in the EU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Name 3 rights guaranteed by the personal protection rights of GDPR

A

The right to…

  • transfer my data to another server
  • delete false data about me
  • challenge an outcome based on my data

NOT to withdraw a paper based on my data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

True or False

To properly data scrape, you should provide a user agent string

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

True or False

To properly data scrape, you should request data at a reasonable rate

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

True or False

To properly data scrape, you should not use an adblocker

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

True or False

To properly data scrape, you should launch the project when the site isn’t busy

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

True or False

To properly data scrape, you should use an API rather than your own code

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How do we check if a distribution is skewed and in what direction?

Answer in terms of a physical diagram

A

Skewed right ‘tilts’ to the left and is negative

Skewed left ‘tilts’ to the right and is positive

Not skewed is symmetrical and is 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How do we check if a distribution is skewed and in what direction?

Answer in terms of mathematics

A

Mean > median -> positive skew

Mean = median -> symmetric

Mean < median -> negative skew

Think of the skew’s sign as the relative size of the mean (a high mean compared to median means a ‘high’ skew)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Is the distribution shown positive, negative, or not skewed?

2 3 5 6 7 7

(Hint: their sum is 30)

A

Distribution given was already in order

Median is 5.5 (between 5 and 6)

Mean is 30/6 = 5

Since the mean < median, the distribution is negatively skewed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the variable for Sample Variance?

A

Sx ^ 2

x is a subscript

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the variable for Standard Deviation?

A

Sx

X is a subscript

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the variable for the correlation coefficient?

24
Q

True or False

The correlation coefficient is always dimensionless

25
What is the variable for the covariance?
Sxy | X and y are subscripts
26
True or False The covariance is always dimensionless
False
27
True or False The correlation coefficient is the same regardless of how we order x and y
True
28
What values does the correlation coefficient range between?
-1 and 1
29
Does the covariance range between -1 and 1?
No - the correlation coefficient does
30
What is the symbol for a standardised variable?
z
31
What is the formula for a standardised variable?
Zi = (Xi- mean) / Sx Xi is a specific point Sx is the standard deviation
32
What is the mean of a standardised variable?
0
33
What is the variance of a standardised variable?
1
34
What is the dimension of a standardised variable?
None - z is always dimensionless
35
What is regressing y on x?
Predicting a y value from its x value using the regression line
36
What is the gradient of a linear regression line?
The correlation coefficient
37
What is larger? R^2 for single regression Adjusted R^2 for single regression
Adjusted R^2 for single regression
38
What is larger? R^2 for multiple regression Adjusted R^2 for multiple regression
Adjusted R^2 for multiple regression
39
What is larger? R^2 for single regression R^2 for multiple regression
R^2 for single regression
40
What is the name of an unseen variable that affects the data?
A lurking variable
41
What is PCA?
Dimensionality reduction whilst keeping as much important information as possible
42
What is a scree plot?
Shows the amount of variance explained by each PC
43
What are the two types of scree plot?
Cumulative (added variance explained) Normal (only individual variance explained for each PC)
44
What is clustering?
Partitioning data into groups based on their distances between other points
45
Finish the statement Clustering is a (supervised/unsupervised) process
Unsupervised
46
What is another method similar to clustering? What is the difference?
Classification This used a training set to predict what points are classed in what group. This has labelled points unlike clustering
47
Finish the statement Classification is a (supervised/unsupervised) process
Supervised
48
What is partitional clustering?
Dividing the set into a fixed number of non-overlapping clusters
49
What is hierarchal clustering?
Levels of clustering - joints two points into two clusters, then joins two clusters, and so on until we have the desired groups
50
What is top-down clustering?
Split everything into two clusters, then split again, until we get clusters of only single points
51
What is agglomerative clustering?
The same as top-down but reversed
52
Finish the statement The cluster centre in K-means is determined by ______.
The mean of the assigned points
53
True or False K-means always returns the same set of clusters
False
54
True or False The K-means algorithm may never converge
False
55
What is the highest number of misclassification errors a 1-NN classifier to make on the training set?
0
56
k-NN gives smoother boundaries the (higher/lower) k is
Higher
57
What is the test error?
An estimate of how well the classifier generalises to unseen data