Weeks 1 to 3 Flashcards

1
Q

What are the three aspects of Data Science

A

Domain Expertise, Maths, and Computer Science

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the difference between a data engineer and a data scientist?

A

Data Engineer- creates physical technology and fixes them

Data Scientist-focuses on the data by fixing it to build models that solves a problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the process of data science i.e. list the 6 steps

A

Define Problem; define machine learning problem; data preparation; explore data analysis; modelling; deployment and evaluation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

State and explain the first step of data science process

A

Define problem; where a clear success criteria is established

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

State and explain the second step of data science process

A

Define machine learning problem; think of concrete tasks for the machine to do

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

State and explain the third step of data science process

A

Data preparation; where raw data is evaluated and may need to be changed (i.e. by scaling data or removing irrelevant instances) before it is entered into machine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

State and explain the fourth step of data science process

A

Exploratory data analysis; where data is explored using basic analysis methods by plotting the data and refining the variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

State and explain the fifth step of data science process

A

Modelling; trying out the model intended to solve the problem through basic testing and lots of trial and error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

State and explain the sixth step of data science process

A

Deployment and evaluation; apply the model and see if the model has to be updated by returning to the process is

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What steps in the process does 80% of the work goes to?

A

Define Problem; Define Machine Learning Problem; Data preparation; Exploratory Data Analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the types of problems in data science?

A

classification; regression; similarity matching; clustering; co-occurrence grouping; profiling; link prediction; data reduction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain classification

A

predict what class/category the individual belongs to in a group; in discrete data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain regression

A

predict the number variable each individual of a group fits into i.e. like the price of a house based on the properties of the house; can be continuous data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain similarity matching

A

identify similar individuals; often underlies certain solutions for other types of problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain clustering

A

group individuals by similarity not driven by any purpose

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explain co-occurrence group

A

associations between entities (things) based on previous transactions; shopping basket context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Explaining profiling

A

characterise behaviour from individual

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Explain link prediction

A

predict links between individuals from the previous links; social media context through suggested friends

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Explain Data reduction

A

where a large set of data is replaced by a smaller set; variables should be reduced

20
Q

What is an unsupervised problem?

A

Where target is NOT specified for group and no information is given to machine beforehand to what to look for

21
Q

What types of problems are unsupervised generally?

A

clustering; co-occurrence grouping; profiling

22
Q

What is a supervised problem?

A

Has target defined and given examples and information before to look for that target i.e. which customers are likely to cancel service

23
Q

What types of problems are supervised generally?

A

classification; regression; casual modelling

24
Q

What can be either supervised or unsupervised generally?

A

similarity matching; link prediction; and data reduction

25
Q

How does the method of k-means clustering group individuals?

A

By the distance between points and other groups; would have smaller distance between points within their own cluster and have a larger distance between other clusters

26
Q

What types of distance can be measured? Explain them

A

Euclidean distance- use Pythagoras theorem where the square root of (x2-x1)^2 + (y2-y1)2 is taken
Manhattan distance- where absolute value of |x1-y1| is added to |x2-y2|

27
Q

What is the method of k-means clustering? (4 steps)

A
  1. Randomly pick n points for centroid
  2. The distance between each centroid and data point is calculated. Assign the closest centroid accordingly.
  3. Take average position of all points (the mean of x and y values) within the cluster and change the centroids to average
  4. Look by eye to see if the centroids make sense and continue running different points a lot until the centroids assignment stops changing
28
Q

What is the cluster distortion?

A

The sum of the squared distances of points to their respective centroid

29
Q

How do you choose the number of k i.e. the number of clusters?

A

Run multiple scenarios with different k value and plot the cluster distortion; find the elbow, where the cluster distortion stops dropping dramatically (in exponential graph, looks like elbow)

30
Q

What are the limitations of k-means clustering?

A

cannot determine the probability of data to be in a cluster; will generally fail if clusters are unequal size, non-spherical, outliers in data-especially with Euclidean distance, and clusters are not well separated

31
Q

Is linear regression supervised?

A

Yes

32
Q

What is a deterministic relationship?

A

The exact relationship between more than one quantities

33
Q

What is a statistical relationship?

A

Where is no precise formula that gives one quantity according to another one

34
Q

How do we determine the prediction error?

A

By subtracting the predicted y (hat y) by the actual y value such that e(i)=y(i) - haty(i)

35
Q

What is the most popular way to minimise the errors? Why?

A

To minimise the sum of squared prediction errors; to cancel out positive and negative errors

36
Q

What is a population regression line?

A

a line that summarises the trend in population between the predictor x and the mean of y for each predictor x

37
Q

What does the sample estimate in regression model for b_0 and b_1x?

A

the population regression line E(Y)=B_0+B_1x

38
Q

What are the assumptions to get the real population parameters B_0 and B_1 for linear regression?

A

the errors are independent random variables with mean zero and constant variance omega^2; the data is linear and errors are normally distributed

39
Q

What is omega^2 (o^2 i.e. variance of population regression line) estimated by?

A

the mean squared error from the least squares regression line (the line of best fit that minimises sum of the squared errors) of the sample data

40
Q

What does a large means squared error mean?

A

data are very spread out around regression line and the predictions are not accurate

41
Q

What is the correlation coefficient r derived from?

A

the sample standard deviations

42
Q

How do we interpret the correlation coefficient r?

A

If the decimal is close to -1 or 1, then stronger linear relationship where negative r is negative relationship. If closer to 0, then weak linear relationship

43
Q

What is the correlation coefficient r^2?

A

the proportion/ percentage of the total variability explained by linear model (i.e. how well does predictor x explain the change in y)

44
Q

How do we interpret r^2?

A

Between 0 and 1; if close to y, then predictor x accounts for sizeable amount of variation in y;
if close to 0, predictor x does not really affect variation in y

45
Q

What can r^2=0 mean?

A

Either the relationship is horizontal or that there is no relationship at all

46
Q

What is the assumptions for multiple linear regression?

A

that the distribution of errors is normal with the mean 0 and variance omega^2 for ALL predictor variables

47
Q

For multiple linear regression, can we fit non-linear relationship between the response variable and predictor variables?

A

Yes