S1 Study Flashcards

1
Q

What are the five steps of data science?

A
Ask
Get
Explore
Model
Communicate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What type of variable is gender?

A

Categorical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What type of variable is height?

A

Numerical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the name of the bar’s width on a histogram?

A

Bin size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the three V’s of data science?

A

Volume
Velocity
Variety

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does EDA stand for?

A

Exploratory Data Analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the goal of EDA?

A

To understand the data better and search for patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is data wrangling?

A

Transforming raw data to be usable later in the process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Are boxplots used when

  • there are lots of outliers?
  • the data is skewed?
  • the data has high dimensionality?
  • we want to see some specific features?
A

When the data is skewed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is GDPR?

A

General Data Protection Regulation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

According to GDPR, can anyone ask a search engine to remove irrelevant data from their search results?

A

Yes

This stands regardless of where their servers are stored since their services are offered in the EU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Name 3 rights guaranteed by the personal protection rights of GDPR

A

The right to…

  • transfer my data to another server
  • delete false data about me
  • challenge an outcome based on my data

NOT to withdraw a paper based on my data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

True or False

To properly data scrape, you should provide a user agent string

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

True or False

To properly data scrape, you should request data at a reasonable rate

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

True or False

To properly data scrape, you should not use an adblocker

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

True or False

To properly data scrape, you should launch the project when the site isn’t busy

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

True or False

To properly data scrape, you should use an API rather than your own code

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How do we check if a distribution is skewed and in what direction?

Answer in terms of a physical diagram

A

Skewed right ‘tilts’ to the left and is negative

Skewed left ‘tilts’ to the right and is positive

Not skewed is symmetrical and is 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How do we check if a distribution is skewed and in what direction?

Answer in terms of mathematics

A

Mean > median -> positive skew

Mean = median -> symmetric

Mean < median -> negative skew

Think of the skew’s sign as the relative size of the mean (a high mean compared to median means a ‘high’ skew)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Is the distribution shown positive, negative, or not skewed?

2 3 5 6 7 7

(Hint: their sum is 30)

A

Distribution given was already in order

Median is 5.5 (between 5 and 6)

Mean is 30/6 = 5

Since the mean < median, the distribution is negatively skewed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the variable for Sample Variance?

A

Sx ^ 2

x is a subscript

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the variable for Standard Deviation?

A

Sx

X is a subscript

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the variable for the correlation coefficient?

A

r

24
Q

True or False

The correlation coefficient is always dimensionless

A

True

25
Q

What is the variable for the covariance?

A

Sxy

X and y are subscripts

26
Q

True or False

The covariance is always dimensionless

A

False

27
Q

True or False

The correlation coefficient is the same regardless of how we order x and y

A

True

28
Q

What values does the correlation coefficient range between?

A

-1 and 1

29
Q

Does the covariance range between -1 and 1?

A

No - the correlation coefficient does

30
Q

What is the symbol for a standardised variable?

A

z

31
Q

What is the formula for a standardised variable?

A

Zi = (Xi- mean) / Sx

Xi is a specific point
Sx is the standard deviation

32
Q

What is the mean of a standardised variable?

A

0

33
Q

What is the variance of a standardised variable?

A

1

34
Q

What is the dimension of a standardised variable?

A

None - z is always dimensionless

35
Q

What is regressing y on x?

A

Predicting a y value from its x value using the regression line

36
Q

What is the gradient of a linear regression line?

A

The correlation coefficient

37
Q

What is larger?

R^2 for single regression

Adjusted R^2 for single regression

A

Adjusted R^2 for single regression

38
Q

What is larger?

R^2 for multiple regression

Adjusted R^2 for multiple regression

A

Adjusted R^2 for multiple regression

39
Q

What is larger?

R^2 for single regression

R^2 for multiple regression

A

R^2 for single regression

40
Q

What is the name of an unseen variable that affects the data?

A

A lurking variable

41
Q

What is PCA?

A

Dimensionality reduction whilst keeping as much important information as possible

42
Q

What is a scree plot?

A

Shows the amount of variance explained by each PC

43
Q

What are the two types of scree plot?

A

Cumulative (added variance explained)

Normal (only individual variance explained for each PC)

44
Q

What is clustering?

A

Partitioning data into groups based on their distances between other points

45
Q

Finish the statement

Clustering is a (supervised/unsupervised) process

A

Unsupervised

46
Q

What is another method similar to clustering? What is the difference?

A

Classification

This used a training set to predict what points are classed in what group.
This has labelled points unlike clustering

47
Q

Finish the statement

Classification is a (supervised/unsupervised) process

A

Supervised

48
Q

What is partitional clustering?

A

Dividing the set into a fixed number of non-overlapping clusters

49
Q

What is hierarchal clustering?

A

Levels of clustering - joints two points into two clusters, then joins two clusters, and so on until we have the desired groups

50
Q

What is top-down clustering?

A

Split everything into two clusters, then split again, until we get clusters of only single points

51
Q

What is agglomerative clustering?

A

The same as top-down but reversed

52
Q

Finish the statement

The cluster centre in K-means is determined by ______.

A

The mean of the assigned points

53
Q

True or False

K-means always returns the same set of clusters

A

False

54
Q

True or False

The K-means algorithm may never converge

A

False

55
Q

What is the highest number of misclassification errors a 1-NN classifier to make on the training set?

A

0

56
Q

k-NN gives smoother boundaries the (higher/lower) k is

A

Higher

57
Q

What is the test error?

A

An estimate of how well the classifier generalises to unseen data