S1 Study Flashcards by Eila Tether

What are the five steps of data science?

Ask
Get
Explore
Model
Communicate

How well did you know this?

Not at all

Perfectly

What type of variable is gender?

Categorical

How well did you know this?

Not at all

Perfectly

What type of variable is height?

Numerical

How well did you know this?

Not at all

Perfectly

What is the name of the bar’s width on a histogram?

Bin size

How well did you know this?

Not at all

Perfectly

What are the three V’s of data science?

Volume
Velocity
Variety

How well did you know this?

Not at all

Perfectly

What does EDA stand for?

Exploratory Data Analysis

How well did you know this?

Not at all

Perfectly

What is the goal of EDA?

To understand the data better and search for patterns

How well did you know this?

Not at all

Perfectly

What is data wrangling?

Transforming raw data to be usable later in the process

How well did you know this?

Not at all

Perfectly

Are boxplots used when

there are lots of outliers?
the data is skewed?
the data has high dimensionality?
we want to see some specific features?

When the data is skewed

How well did you know this?

Not at all

Perfectly

What is GDPR?

General Data Protection Regulation

How well did you know this?

Not at all

Perfectly

According to GDPR, can anyone ask a search engine to remove irrelevant data from their search results?

Yes

This stands regardless of where their servers are stored since their services are offered in the EU

How well did you know this?

Not at all

Perfectly

Name 3 rights guaranteed by the personal protection rights of GDPR

The right to…

transfer my data to another server
delete false data about me
challenge an outcome based on my data

NOT to withdraw a paper based on my data

How well did you know this?

Not at all

Perfectly

True or False

To properly data scrape, you should provide a user agent string

True

How well did you know this?

Not at all

Perfectly

True or False

To properly data scrape, you should request data at a reasonable rate

True

How well did you know this?

Not at all

Perfectly

True or False

To properly data scrape, you should not use an adblocker

False

How well did you know this?

Not at all

Perfectly

True or False

To properly data scrape, you should launch the project when the site isn’t busy

False

How well did you know this?

Not at all

Perfectly

True or False

To properly data scrape, you should use an API rather than your own code

True

How well did you know this?

Not at all

Perfectly

How do we check if a distribution is skewed and in what direction?

Answer in terms of a physical diagram

Skewed right ‘tilts’ to the left and is negative

Skewed left ‘tilts’ to the right and is positive

Not skewed is symmetrical and is 0

How well did you know this?

Not at all

Perfectly

How do we check if a distribution is skewed and in what direction?

Answer in terms of mathematics

Mean > median -> positive skew

Mean = median -> symmetric

Mean < median -> negative skew

Think of the skew’s sign as the relative size of the mean (a high mean compared to median means a ‘high’ skew)

How well did you know this?

Not at all

Perfectly

Is the distribution shown positive, negative, or not skewed?

2 3 5 6 7 7

(Hint: their sum is 30)

Distribution given was already in order

Median is 5.5 (between 5 and 6)

Mean is 30/6 = 5

Since the mean < median, the distribution is negatively skewed

How well did you know this?

Not at all

Perfectly

What is the variable for Sample Variance?

Sx ^ 2

x is a subscript

How well did you know this?

Not at all

Perfectly

What is the variable for Standard Deviation?

X is a subscript

How well did you know this?

Not at all

Perfectly

What is the variable for the correlation coefficient?

True or False

The correlation coefficient is always dimensionless

True

What is the variable for the covariance?

Sxy | X and y are subscripts

True or False The covariance is always dimensionless

False

True or False The correlation coefficient is the same regardless of how we order x and y

True

What values does the correlation coefficient range between?

-1 and 1

Does the covariance range between -1 and 1?

No - the correlation coefficient does

What is the symbol for a standardised variable?

What is the formula for a standardised variable?

Zi = (Xi- mean) / Sx Xi is a specific point Sx is the standard deviation

What is the mean of a standardised variable?

What is the variance of a standardised variable?

What is the dimension of a standardised variable?

None - z is always dimensionless

What is regressing y on x?

Predicting a y value from its x value using the regression line

What is the gradient of a linear regression line?

The correlation coefficient

What is larger? R^2 for single regression Adjusted R^2 for single regression

Adjusted R^2 for single regression

What is larger? R^2 for multiple regression Adjusted R^2 for multiple regression

Adjusted R^2 for multiple regression

What is larger? R^2 for single regression R^2 for multiple regression

R^2 for single regression

What is the name of an unseen variable that affects the data?

A lurking variable

What is PCA?

Dimensionality reduction whilst keeping as much important information as possible

What is a scree plot?

Shows the amount of variance explained by each PC

What are the two types of scree plot?

Cumulative (added variance explained) Normal (only individual variance explained for each PC)

What is clustering?

Partitioning data into groups based on their distances between other points

Finish the statement Clustering is a (supervised/unsupervised) process

Unsupervised

What is another method similar to clustering? What is the difference?

Classification This used a training set to predict what points are classed in what group. This has labelled points unlike clustering

Finish the statement Classification is a (supervised/unsupervised) process

Supervised

What is partitional clustering?

Dividing the set into a fixed number of non-overlapping clusters

What is hierarchal clustering?

Levels of clustering - joints two points into two clusters, then joins two clusters, and so on until we have the desired groups

What is top-down clustering?

Split everything into two clusters, then split again, until we get clusters of only single points

What is agglomerative clustering?

The same as top-down but reversed

Finish the statement The cluster centre in K-means is determined by ______.

The mean of the assigned points

True or False K-means always returns the same set of clusters

False

True or False The K-means algorithm may never converge

False

What is the highest number of misclassification errors a 1-NN classifier to make on the training set?

k-NN gives smoother boundaries the (higher/lower) k is

Higher

What is the test error?

An estimate of how well the classifier generalises to unseen data