Weeks 1 to 3 Flashcards by Myka Lumibao

What are the three aspects of Data Science

Domain Expertise, Maths, and Computer Science

How well did you know this?

Not at all

Perfectly

What is the difference between a data engineer and a data scientist?

Data Engineer- creates physical technology and fixes them

Data Scientist-focuses on the data by fixing it to build models that solves a problem

How well did you know this?

Not at all

Perfectly

What is the process of data science i.e. list the 6 steps

Define Problem; define machine learning problem; data preparation; explore data analysis; modelling; deployment and evaluation

How well did you know this?

Not at all

Perfectly

State and explain the first step of data science process

Define problem; where a clear success criteria is established

How well did you know this?

Not at all

Perfectly

State and explain the second step of data science process

Define machine learning problem; think of concrete tasks for the machine to do

How well did you know this?

Not at all

Perfectly

State and explain the third step of data science process

Data preparation; where raw data is evaluated and may need to be changed (i.e. by scaling data or removing irrelevant instances) before it is entered into machine

How well did you know this?

Not at all

Perfectly

State and explain the fourth step of data science process

Exploratory data analysis; where data is explored using basic analysis methods by plotting the data and refining the variables

How well did you know this?

Not at all

Perfectly

State and explain the fifth step of data science process

Modelling; trying out the model intended to solve the problem through basic testing and lots of trial and error

How well did you know this?

Not at all

Perfectly

State and explain the sixth step of data science process

Deployment and evaluation; apply the model and see if the model has to be updated by returning to the process is

How well did you know this?

Not at all

Perfectly

What steps in the process does 80% of the work goes to?

Define Problem; Define Machine Learning Problem; Data preparation; Exploratory Data Analysis

How well did you know this?

Not at all

Perfectly

What are the types of problems in data science?

classification; regression; similarity matching; clustering; co-occurrence grouping; profiling; link prediction; data reduction

How well did you know this?

Not at all

Perfectly

Explain classification

predict what class/category the individual belongs to in a group; in discrete data

How well did you know this?

Not at all

Perfectly

Explain regression

predict the number variable each individual of a group fits into i.e. like the price of a house based on the properties of the house; can be continuous data

How well did you know this?

Not at all

Perfectly

Explain similarity matching

identify similar individuals; often underlies certain solutions for other types of problems

How well did you know this?

Not at all

Perfectly

Explain clustering

group individuals by similarity not driven by any purpose

How well did you know this?

Not at all

Perfectly

Explain co-occurrence group

associations between entities (things) based on previous transactions; shopping basket context

How well did you know this?

Not at all

Perfectly

Explaining profiling

characterise behaviour from individual

How well did you know this?

Not at all

Perfectly

Explain link prediction

predict links between individuals from the previous links; social media context through suggested friends

How well did you know this?

Not at all

Perfectly

Explain Data reduction

where a large set of data is replaced by a smaller set; variables should be reduced

What is an unsupervised problem?

Where target is NOT specified for group and no information is given to machine beforehand to what to look for

What types of problems are unsupervised generally?

clustering; co-occurrence grouping; profiling

What is a supervised problem?

Has target defined and given examples and information before to look for that target i.e. which customers are likely to cancel service

What types of problems are supervised generally?

classification; regression; casual modelling

What can be either supervised or unsupervised generally?

similarity matching; link prediction; and data reduction

How does the method of k-means clustering group individuals?

By the distance between points and other groups; would have smaller distance between points within their own cluster and have a larger distance between other clusters

What types of distance can be measured? Explain them

Euclidean distance- use Pythagoras theorem where the square root of (x2-x1)^2 + (y2-y1)2 is taken Manhattan distance- where absolute value of |x1-y1| is added to |x2-y2|

What is the method of k-means clustering? (4 steps)

1. Randomly pick n points for centroid 2. The distance between each centroid and data point is calculated. Assign the closest centroid accordingly. 3. Take average position of all points (the mean of x and y values) within the cluster and change the centroids to average 4. Look by eye to see if the centroids make sense and continue running different points a lot until the centroids assignment stops changing

What is the cluster distortion?

The sum of the squared distances of points to their respective centroid

How do you choose the number of k i.e. the number of clusters?

Run multiple scenarios with different k value and plot the cluster distortion; find the elbow, where the cluster distortion stops dropping dramatically (in exponential graph, looks like elbow)

What are the limitations of k-means clustering?

cannot determine the probability of data to be in a cluster; will generally fail if clusters are unequal size, non-spherical, outliers in data-especially with Euclidean distance, and clusters are not well separated

Is linear regression supervised?

Yes

What is a deterministic relationship?

The exact relationship between more than one quantities

What is a statistical relationship?

Where is no precise formula that gives one quantity according to another one

How do we determine the prediction error?

By subtracting the predicted y (hat y) by the actual y value such that e(i)=y(i) - haty(i)

What is the most popular way to minimise the errors? Why?

To minimise the sum of squared prediction errors; to cancel out positive and negative errors

What is a population regression line?

a line that summarises the trend in population between the predictor x and the mean of y for each predictor x

What does the sample estimate in regression model for b_0 and b_1x?

the population regression line E(Y)=B_0+B_1x

What are the assumptions to get the real population parameters B_0 and B_1 for linear regression?

the errors are independent random variables with mean zero and constant variance omega^2; the data is linear and errors are normally distributed

What is omega^2 (o^2 i.e. variance of population regression line) estimated by?

the mean squared error from the least squares regression line (the line of best fit that minimises sum of the squared errors) of the sample data

What does a large means squared error mean?

data are very spread out around regression line and the predictions are not accurate

What is the correlation coefficient r derived from?

the sample standard deviations

How do we interpret the correlation coefficient r?

If the decimal is close to -1 or 1, then stronger linear relationship where negative r is negative relationship. If closer to 0, then weak linear relationship

What is the correlation coefficient r^2?

the proportion/ percentage of the total variability explained by linear model (i.e. how well does predictor x explain the change in y)

How do we interpret r^2?

Between 0 and 1; if close to y, then predictor x accounts for sizeable amount of variation in y; if close to 0, predictor x does not really affect variation in y

What can r^2=0 mean?

Either the relationship is horizontal or that there is no relationship at all

What is the assumptions for multiple linear regression?

that the distribution of errors is normal with the mean 0 and variance omega^2 for ALL predictor variables

For multiple linear regression, can we fit non-linear relationship between the response variable and predictor variables?

Yes