Introduction Flashcards

1
Q

Machine Learning

A

It is the process of extracting patterns from the data.
Data includes features and target.
If an expert can deduce pattern from data, so can ML

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Features

A

All the information about an object e.g. different characteristics of let’s say a car, year, make, model etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Target

A

Output or labels we want to predict based on the features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Training

A

Features + Target = Supervised Learning
This results in the model which has learned the patterns based on features and target.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Predictions

A

Features + Model = Predictions
We put features in model which predicts the target variable/label.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why not use Rule based Systems

A

It is difficult to write rules for every possible scenario e.g. in case of spam emails, writing rules would quickly end up with huge and messy code.
We can use ML by simply giving it the data i.e. features and target as spam/not spam. We can train the model and then use the model on new dataset. Most of these rules are converted to features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does a model output?

A

Model outputs probabilities. A threshold can be defined to make a final decision on probabilities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Rule based systems

A

We have a software which takes in data and code to output the outcome and it can become difficult to maintain.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Supervised ML

A

We show model examples of the data e.g. labelling data so that the model can learn patterns from the features and labels using mathematics and statistics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Feature Matrix

A

A two dimensional array (matrix) with columns as features and rows as objects/observations for which we want to predict and usually is denoted by X.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Target Vector

A

A vector for each row of the feature Matrix X. And it is denoted by y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Mathematical Expression for Model

A

g(X) = approximates y
g is the model, X is the feature Matrix and y is the target variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Types of Supervised ML is based on

A

output of the model and the type of target variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Regression

A

A type of supervised machine learning where the model returns a number between 0 and infinity.
E.g. prediction of car prices, prediction of house prices

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Classification

A

A type of supervised machine learning where the output is a category e.g. output of an image is a car, output is a spam/not spam.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Multi Class Classification

A

A type of supervised machine learning and a sub type of classification where output can be multiple categories e.g. a car, a dog or a cat. It can be as many categories as you need as long as they are more than 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Binary Classification

A

A type of supervised machine learning and a sub type of classification where the target variable can only be either of two categories e.g. Spam/Not Spam.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Ranking

A

A type of supervised machine learning where you want to rank something e.g. a recommendation system. When we search something on Google, it ranks web pages based on the user and search relevance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

CRISP-DM

A

A methodology for organizing ML projects. It stands for Cross Industry Standard Processing - Data Mining.
From problem understanding to deployment. It’s an old methodology that was developed by IBM in the 90s.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

ML Projects (6 steps of CRISP-DM)

A
  1. Business understanding of the problem
    +Do we need ML to solve the problem? If not, what is the alternative solution.
    +Identify the problem and if it is important
    +What’s the measurable goal? e.g. reducing the number of spam messages.
  2. Data understanding
    +What data is available to solve the problem?
    +How we can get the data? Buy or maybe collect the dataset.
    +Is it reliable?
    +Do we track it correctly?
    + Is the dataset large enough? Do we need to get more data?
    +Sometimes we go back to the first step if the problem or data is not suitable.
  3. Data Preparation
    +Extracting features
    +Cleaning data
    +Pipelines that convert raw data and transform into suitable features 4. Modelling
    +Train the model
    +Try different models and choose the best one
    +Add new features or fix data issues
  4. Evaluation
    +Go back to the business understanding and check results whether our metrics approve or not.
    +Maybe we need to go back to the business understanding and start again
  5. Evaluation + Deployment:
    +Online evaluation of live users
    +Deploy the model and evaluate it
    +Evaluate it on small number of users
    +Roll the model to all users, proper monitoring, ensuring the quality and maintainability
  6. Iterate
    +Is it good? Should we improve it or not?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Selecting the best model

A

We need to mimic the model performance on real unseen data. This can be done by keeping the 20% of the data separately and train model on the remaining 80% of the data. 20% of the data is then our validation dataset. We take g on the validation dataset and we get predictions. We compare the validation prediction with the actual values . We then see in how many cases this is correct. We need to improve the accuracy of the model and we choose the best model based on the best accuracy. Different types of models can be logistic regression, Decision Trees, Random Forest, Neural Networks etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Multiple Comparisons Problem

A

It could be that a model gets lucky in predicting a particular type of dataset if we try many many different models.. If we take another 20% of the data, the results could be totally different This is a statistics problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Validation / Test dataset Split

A

To guard against Multiple comparison problem, we can have three non-overlapping datasets. E.g. 60% training dataset, 20% validation dataset, 20% testing dataset. So to make sure that this model didn’t got lucky, we select the best model and check it on testing dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Steps of Model selection

A
  1. Split dataset into train, validation and test
  2. Training
  3. Validation
  4. Repeat 2,3 for different models and select the best one
  5. Apply on the test dataset and check
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Examples of Supervised Machine Learning

A

1) Spam Filtering with features from email data set and labels spam or not
2) Online advertising where ads and user infor as input data and labels whether a user is likely to click on an ad or not.
3) Self driving cars, given the image and radar info, position of other cars as labels
4) Fuel optimization, given the ship route and fuel consumed as labels
5) Visual Inspection, given the image of phone and detecting a defect in the image.
6) Restaurant Reputation monitoring, given the restaurant reviews, detect sentiment positive or negative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

[2010-2020] Large Scale Supervised Learning

A

If you were training AI model on small to large datasets on your local machine, the performance plateau. But if you were training AI model on small to large datasets with large computation, then we see significant improvement in AI model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

This decade (2020-2030) adding Generative AI to Supervised Learning

A

Generative AI is built by using supervised learning (A > B) to repeatedly predict the next word. If we train a very large AI system on a lot of data (hundreds of billions of words) we get a large language model like ChatGPT

28
Q

Opportunity: AI application Development

A

Prompting and LLMs are revolutionizing AI application development. e.g. Supervised learning (Get labelled data (1 month), Train AI model on data (3 months), Deploy (run) model). So commercial grade projects usually took 3-6 months.
Prompt based AI (Specify Prompt (minutes/hours) and deploy model (hours/days))

29
Q

AI opportunities

A

Today (Supervised Learning is the most popular)
3 years (Generative AI will become more popular, Unsupervised Learning )
The light regions around Supervised Learning and Generative AI are opportunities for new startups and companies.
These AI technologies are general purpose technologies and has huge opportunities.
There will be fads along the way. May not be very long term value. Opportunity to create deep and hard applications which provides long term value out of the Generative AI.

30
Q

Why isn’t AI widely adopted yet?

A

Customization (long tail) problem and low/no code tools
Most of the billion dollar work for AI is being done in Advertisments and web search.
As we move towards other industries, we can create more value out of AI products. e.g. Food inspection (pictures of Pizza to see if it’s evenly distributed or not), Wheat harvesting (How tall is the wheat? More food to sell and better for env), material grading, etc. All these projects are 5 million dollar projects and it does not make sense to hire 100s of high skilled engineers to work on these projects. Until now it was difficult.
Low/no code tools are enabling the user to customize their AI system. Users do this by providing prompts (or data) instead of writing code.

31
Q

AI Stack

A

First layer: Applications (It should be more powerful to pay for bottom layers)
Middle layer: Infrastructure (AWS, Google Cloud) and Developer Tools (Landing AI, Rapid Fire)
Bottom Layer : Hardware (Nvidia, Intel, AMD) Capital Heavy

32
Q

Building Startups

A

1) Ideas e.g. Fuel efficient ships
2) Validate i.e. Market & Technical Validation by AI Fund team
3) Recruit CEO to build with us (Founder in residence)
4) Build w CEO (3 months) Deep customer and technical validation. Build Prototype.
5) Pre-Seed Growth (12 months) $1M Pre-Seed, Hire key executives, Build MVP, Get early customer traction
6) Indefinite (Seed, Growth scale): ~$2.5 M seed funding. Startup graduates and is well on its way.

33
Q

Numpy: Array of zeros or ones

A

np.zeros(5) -> Takes one argument which is size of array
np.ones(5) -> Takes one argument which is size of array

34
Q

Numpy: Function to fill array with an arbitrary number

A

np.full(10, 2.5)
first argument is size of the array and the second argument is the number with which to fill

35
Q

Numpy: Convert python list to numpy array

A

a = [1, 2, 3]
np.array(a)

36
Q

Numpy accessing an array element

A

Index in Python starts from 0 so we can access it as a[2] which is the 3rd element of the array.
We can also change the value using assignment operator i.e. a[2] = 10

37
Q

Numpy: Arrange

A

np.arrange(10)
It creates an array from 0 to 9
np.arrange(3, 10)
It creates an array from 3 to 9
Note that 10 is not inclusive since array starts from 0

38
Q

Numpy: linspace

A

np.linspace(0, 1, 11)
Creates an array between first and second parameter according to size in third parameter. First and second parameters are inclusive.

39
Q

Numpy: Multidimensional arrays

A

np.zeros(5, 2)
First parameter is rows and second parameter is columns.

40
Q

Numpy: Multidimensional arrays from python list of list

A

a= [[1,2,3], [4, 5, 6], [7,8,9]]
n=np.array(a)
Indexes start from 0 for both rows and columns.
n[0, 1]= 20
If we want to get the row instead of indices we can just say n[0]
We can update row as well n[0]=[1,1,1]
To access columns
n[:,1] get the second column with all rows
We can also assign something else to it e.g. n[:,1]=[2,3,4]

41
Q

Numpy: Random generated arrays

A

np.random.seed(2)
np.random.rand(5,2)
If we want to make sure that it produces same random number, we add seed.
You can multiply every number with 100 to get numbers between 0 and 100. e.g. 100 * np.random.rand(5,2)

42
Q

Numpy: Random Number Distributions

A

There are many different distributions e.g. if you want to draw from standard random distributions e.g. np.random.randn(5,2)

43
Q

Numpy: Random Integers

A

If we want to produce a matrix of integers. High parameters is not inclusive. np.random.randin(low=0, high=100, size=(5,2))

44
Q

Numpy: Element-wise operations

A

Numpy makes it easier to apply operations without for loop.
e g. a=np.arrange(5)
a+1
a2
a/20
(10 + (a
2))**2/100
We can apply operations between two arrays as well.
a/b *10
a+b

45
Q

Numpy: Comparison operations

A

e.g.
a>=2
a > b returns an array of true or false
a[a>b] returns elements which are true

46
Q

Numpy: Summarizing operations

A

Operations which do not return an array rather returns a single number e.g.
a.min()
a.max()
a.sum()
a.mean()
a.std()
It also works for 2-D array as well.

47
Q

Linear Algebra: Vector Operations

A

Column Vector by 2 = multiply every element with 2
Add 2 vectors = add each element of the vector

48
Q

Linear Algebra: Vector by Vector multiplication (2 column vectors)

A

It is also called dot product or inner product.
Vector U by Vector V. In numpy it’s element by element multiplication in the form of a vector but in linear algebra, a number is produced by multiplying each element separately and then adding the result.

49
Q

Linear Algebra: Row Vector by column Vector multiplications

A

Row vector is transposed and multiplied with column Vector
uTu= Sum(i=1,n) uiu

Shape should be same.

50
Q

Numpy: Dot product

A

We can implement a function but we also have a function in numpy which is called u.dot(v)

51
Q

Linear Algebra: Matrix and Vector Multiplication

A

For each row of the matrix we multiply with the column Vector.
U0TU to Uk-1TU.
Number of columns in matrix should match with number of rows in column Vector or element of column vector
In numpy, we can use dot function for matrix to vector multiplication i.e. u.dot(v)

52
Q

Linear Algebra: Matrix by Matrix multiplication

A

We take the first matrix as whole and we divide the second matrix as vector columns.
Then we multiply each matrix with vector column.
We check the dimensions as number of columns in matrix is equal to number of elements or rows in vector column.
The size of the output matrix is rows from first matrix and number of columns come from second matrix.
With numpy we can do U.dot(v)

53
Q

Linear Algebra: Identity Matrix

A

A square matrix with 1s on diagonal and zeros everywhere else.
If I is multiplied by U matrix, we get U back.
In numpy, np.eye(3)

54
Q

Linear Algebra: Matrix Inverse

A

Matrix A and inverse of Matrix A-1, when multiplied returns an identity matrix.
So only square matrix has the inverse.
Square matrix is the one with rows and columns equal
Inverse of matrix in numpy is calculated as np.linalg.inv(Vs)
This is quite useful for linear regression.

55
Q

Pandas

A

It’s a library in python for manipulating tabular data

56
Q

Pandas: Dataframe

A

Dataframe is a table.
df = pd.DataFrame(data, columns)
data is list of list and columns is a list
Each sublist is a row in the dataframe.
We can also define a list of dictionaries where each dictionary is a row and keys are columns.
We can use head() to check first few rows of the data. E.g. df.head(n=5)
Every column is a pandas series

57
Q

Pandas: Series

A

Accessing a column in pandas e.g.
df.Make or df[‘Make’]
Subset of columns e.g. df[[‘Make’, ‘Model’]]
Adding another column to the dataframe e.g. df[‘id’] =[1, 2,3,4,5]
To delete a column e.g. del df[‘id’]

58
Q

Pandas: Index

A

IDs of rows in pandas is index
e.g. df.index
df.Make.index
We can access elements of the dataframe using this index
e.g. df.loc[1] returns the row
We can also generate multiple rows e.g. df.loc[[1, 2]]
We can replace an index too e.g. df.index=[‘a’, ‘b’]
We can use positional index.
We will have to use iloc
e.g. df.iloc[[1, 2, 4]]
df.reset_index(drop=True)

59
Q

Pandas: Element-wise operations

A

We can apply all operations that can be done on numpy array e.g. division, subtraction, Multiplication etc but we do it on pandas series. If there is a null value it won’t process it and simply return NaN.

We can use comparators and do filtering as well.

60
Q

Pandas: String and Summarizing operations

A

String operation include str.lower(), str.upper(), str.replace(value-to-replace, value-to-replace-with) etc

Summarizing operations e.g. sum(), min(), max(),mean(), describe() function can tell about all the statistics of column e.g. mean, std deviations etc we can do describe on both series and dataframe

These are applied on pandas series so it updates each element of the pandas series.

61
Q

Pandas: Unique Values

A

Number of unique values for a single column and for the dataframe where it gives number of unique values for each column in the dataframe.
df.nunique()

62
Q

Pandas: Null Values

A

df.isnull().sum()
It provides sum values across each column where there is a null value.

63
Q

Pandas: Underlying numpy

A

df.col.values return an underlying numpy array
df.to_numpy() also returns a numpy matrix

64
Q

Regression Steps

A

1) Select relevant columns and create a numpy matrix X.
2) Compute matrix - matrix multiplication between X transpose and X called XTX.
3) Take the inverse of XTX.
4) Create a target variable y.
5) Multiply the inverse of XTX with the transpose of X and then multiply by y.
6) This is called w.

65
Q

Real World ML project steps

A

1) Prepare Data and do EDA (Exploratory Data Analysis)
2) Use a model e.g. linear regression
3) Evaluate the model with RMSE
4) Feature Engineering
5) Regularization
6) Using the model