Introduction Flashcards

Question 1

Q

Machine Learning

Answer

A

It is the process of extracting patterns from the data.
Data includes features and target.
If an expert can deduce pattern from data, so can ML

Question 2

Q

Features

Answer

A

All the information about an object e.g. different characteristics of let’s say a car, year, make, model etc

Question 3

Q

Target

Answer

A

Output or labels we want to predict based on the features

Question 4

Q

Training

Answer

A

Features + Target = Supervised Learning
This results in the model which has learned the patterns based on features and target.

Question 5

Q

Predictions

Answer

A

Features + Model = Predictions
We put features in model which predicts the target variable/label.

Question 6

Q

Why not use Rule based Systems

Answer

A

It is difficult to write rules for every possible scenario e.g. in case of spam emails, writing rules would quickly end up with huge and messy code.
We can use ML by simply giving it the data i.e. features and target as spam/not spam. We can train the model and then use the model on new dataset. Most of these rules are converted to features.

Question 7

Q

What does a model output?

Answer

A

Model outputs probabilities. A threshold can be defined to make a final decision on probabilities.

Question 8

Q

Rule based systems

Answer

A

We have a software which takes in data and code to output the outcome and it can become difficult to maintain.

Question 9

Q

Supervised ML

Answer

A

We show model examples of the data e.g. labelling data so that the model can learn patterns from the features and labels using mathematics and statistics.

Question 10

Q

Feature Matrix

Answer

A

A two dimensional array (matrix) with columns as features and rows as objects/observations for which we want to predict and usually is denoted by X.

Question 11

Q

Target Vector

Answer

A

A vector for each row of the feature Matrix X. And it is denoted by y.

Question 12

Q

Mathematical Expression for Model

Answer

A

g(X) = approximates y
g is the model, X is the feature Matrix and y is the target variable.

Question 13

Q

Types of Supervised ML is based on

Answer

A

output of the model and the type of target variable

Question 14

Q

Regression

Answer

A

A type of supervised machine learning where the model returns a number between 0 and infinity.
E.g. prediction of car prices, prediction of house prices

Question 15

Q

Classification

Answer

A

A type of supervised machine learning where the output is a category e.g. output of an image is a car, output is a spam/not spam.

Question 16

Q

Multi Class Classification

Answer

A

A type of supervised machine learning and a sub type of classification where output can be multiple categories e.g. a car, a dog or a cat. It can be as many categories as you need as long as they are more than 2

Question 17

Q

Binary Classification

Answer

A

A type of supervised machine learning and a sub type of classification where the target variable can only be either of two categories e.g. Spam/Not Spam.

Question 18

Q

Ranking

Answer

A

A type of supervised machine learning where you want to rank something e.g. a recommendation system. When we search something on Google, it ranks web pages based on the user and search relevance.

Question 19

Q

CRISP-DM

Answer

A

A methodology for organizing ML projects. It stands for Cross Industry Standard Processing - Data Mining.
From problem understanding to deployment. It’s an old methodology that was developed by IBM in the 90s.

Question 20

Q

ML Projects (6 steps of CRISP-DM)

Answer

A

Business understanding of the problem
+Do we need ML to solve the problem? If not, what is the alternative solution.
+Identify the problem and if it is important
+What’s the measurable goal? e.g. reducing the number of spam messages.
Data understanding
+What data is available to solve the problem?
+How we can get the data? Buy or maybe collect the dataset.
+Is it reliable?
+Do we track it correctly?
+ Is the dataset large enough? Do we need to get more data?
+Sometimes we go back to the first step if the problem or data is not suitable.
Data Preparation
+Extracting features
+Cleaning data
+Pipelines that convert raw data and transform into suitable features 4. Modelling
+Train the model
+Try different models and choose the best one
+Add new features or fix data issues
Evaluation
+Go back to the business understanding and check results whether our metrics approve or not.
+Maybe we need to go back to the business understanding and start again
Evaluation + Deployment:
+Online evaluation of live users
+Deploy the model and evaluate it
+Evaluate it on small number of users
+Roll the model to all users, proper monitoring, ensuring the quality and maintainability
Iterate
+Is it good? Should we improve it or not?

Question 21

Q

Selecting the best model

Answer

A

We need to mimic the model performance on real unseen data. This can be done by keeping the 20% of the data separately and train model on the remaining 80% of the data. 20% of the data is then our validation dataset. We take g on the validation dataset and we get predictions. We compare the validation prediction with the actual values . We then see in how many cases this is correct. We need to improve the accuracy of the model and we choose the best model based on the best accuracy. Different types of models can be logistic regression, Decision Trees, Random Forest, Neural Networks etc

Question 22

Q

Multiple Comparisons Problem

Answer

A

It could be that a model gets lucky in predicting a particular type of dataset if we try many many different models.. If we take another 20% of the data, the results could be totally different This is a statistics problem.

Question 23

Q

Validation / Test dataset Split

Answer

A

To guard against Multiple comparison problem, we can have three non-overlapping datasets. E.g. 60% training dataset, 20% validation dataset, 20% testing dataset. So to make sure that this model didn’t got lucky, we select the best model and check it on testing dataset.

Question 24

Q

Steps of Model selection

Answer

A

Split dataset into train, validation and test
Training
Validation
Repeat 2,3 for different models and select the best one
Apply on the test dataset and check

Question 25

Q

Examples of Supervised Machine Learning

Answer

A

1) Spam Filtering with features from email data set and labels spam or not
2) Online advertising where ads and user infor as input data and labels whether a user is likely to click on an ad or not.
3) Self driving cars, given the image and radar info, position of other cars as labels
4) Fuel optimization, given the ship route and fuel consumed as labels
5) Visual Inspection, given the image of phone and detecting a defect in the image.
6) Restaurant Reputation monitoring, given the restaurant reviews, detect sentiment positive or negative

Question 26

Q

[2010-2020] Large Scale Supervised Learning

Answer

A

If you were training AI model on small to large datasets on your local machine, the performance plateau. But if you were training AI model on small to large datasets with large computation, then we see significant improvement in AI model.

Question 27

Q

This decade (2020-2030) adding Generative AI to Supervised Learning

Answer

A

Generative AI is built by using supervised learning (A > B) to repeatedly predict the next word. If we train a very large AI system on a lot of data (hundreds of billions of words) we get a large language model like ChatGPT

Question 28

Q

Opportunity: AI application Development

Answer

A

Prompting and LLMs are revolutionizing AI application development. e.g. Supervised learning (Get labelled data (1 month), Train AI model on data (3 months), Deploy (run) model). So commercial grade projects usually took 3-6 months.
Prompt based AI (Specify Prompt (minutes/hours) and deploy model (hours/days))

Question 29

Q

AI opportunities

Answer

A

Today (Supervised Learning is the most popular)
3 years (Generative AI will become more popular, Unsupervised Learning )
The light regions around Supervised Learning and Generative AI are opportunities for new startups and companies.
These AI technologies are general purpose technologies and has huge opportunities.
There will be fads along the way. May not be very long term value. Opportunity to create deep and hard applications which provides long term value out of the Generative AI.

Question 30

Q

Why isn’t AI widely adopted yet?

Answer

A

Customization (long tail) problem and low/no code tools
Most of the billion dollar work for AI is being done in Advertisments and web search.
As we move towards other industries, we can create more value out of AI products. e.g. Food inspection (pictures of Pizza to see if it’s evenly distributed or not), Wheat harvesting (How tall is the wheat? More food to sell and better for env), material grading, etc. All these projects are 5 million dollar projects and it does not make sense to hire 100s of high skilled engineers to work on these projects. Until now it was difficult.
Low/no code tools are enabling the user to customize their AI system. Users do this by providing prompts (or data) instead of writing code.

Question 31

Q

AI Stack

Answer

A

First layer: Applications (It should be more powerful to pay for bottom layers)
Middle layer: Infrastructure (AWS, Google Cloud) and Developer Tools (Landing AI, Rapid Fire)
Bottom Layer : Hardware (Nvidia, Intel, AMD) Capital Heavy

Question 32

Q

Building Startups

Answer

A

1) Ideas e.g. Fuel efficient ships
2) Validate i.e. Market & Technical Validation by AI Fund team
3) Recruit CEO to build with us (Founder in residence)
4) Build w CEO (3 months) Deep customer and technical validation. Build Prototype.
5) Pre-Seed Growth (12 months) $1M Pre-Seed, Hire key executives, Build MVP, Get early customer traction
6) Indefinite (Seed, Growth scale): ~$2.5 M seed funding. Startup graduates and is well on its way.

Question 33

Q

Numpy: Array of zeros or ones

Answer

A

np.zeros(5) -> Takes one argument which is size of array
np.ones(5) -> Takes one argument which is size of array

Question 34

Q

Numpy: Function to fill array with an arbitrary number

Answer

A

np.full(10, 2.5)
first argument is size of the array and the second argument is the number with which to fill

Question 35

Q

Numpy: Convert python list to numpy array

Answer

A

a = [1, 2, 3]
np.array(a)

Question 36

Q

Numpy accessing an array element

Answer

A

Index in Python starts from 0 so we can access it as a[2] which is the 3rd element of the array.
We can also change the value using assignment operator i.e. a[2] = 10

Question 37

Q

Numpy: Arrange

Answer

A

np.arrange(10)
It creates an array from 0 to 9
np.arrange(3, 10)
It creates an array from 3 to 9
Note that 10 is not inclusive since array starts from 0

Question 38

Q

Numpy: linspace

Answer

A

np.linspace(0, 1, 11)
Creates an array between first and second parameter according to size in third parameter. First and second parameters are inclusive.

Question 39

Q

Numpy: Multidimensional arrays

Answer

A

np.zeros(5, 2)
First parameter is rows and second parameter is columns.

Question 40

Q

Numpy: Multidimensional arrays from python list of list

Answer

A

a= [[1,2,3], [4, 5, 6], [7,8,9]]
n=np.array(a)
Indexes start from 0 for both rows and columns.
n[0, 1]= 20
If we want to get the row instead of indices we can just say n[0]
We can update row as well n[0]=[1,1,1]
To access columns
n[:,1] get the second column with all rows
We can also assign something else to it e.g. n[:,1]=[2,3,4]

Question 41

Q

Numpy: Random generated arrays

Answer

A

np.random.seed(2)
np.random.rand(5,2)
If we want to make sure that it produces same random number, we add seed.
You can multiply every number with 100 to get numbers between 0 and 100. e.g. 100 * np.random.rand(5,2)

Question 42

Q

Numpy: Random Number Distributions

Answer

A

There are many different distributions e.g. if you want to draw from standard random distributions e.g. np.random.randn(5,2)

Question 43

Q

Numpy: Random Integers

Answer

A

If we want to produce a matrix of integers. High parameters is not inclusive. np.random.randin(low=0, high=100, size=(5,2))

Question 44

Q

Numpy: Element-wise operations

Answer

A

Numpy makes it easier to apply operations without for loop.
e g. a=np.arrange(5)
a+1
a2
a/20
(10 + (a2))**2/100
We can apply operations between two arrays as well.
a/b *10
a+b

Question 45

Q

Numpy: Comparison operations

Answer

A

e.g.
a>=2
a > b returns an array of true or false
a[a>b] returns elements which are true

Question 46

Q

Numpy: Summarizing operations

Answer

A

Operations which do not return an array rather returns a single number e.g.
a.min()
a.max()
a.sum()
a.mean()
a.std()
It also works for 2-D array as well.

Question 47

Q

Linear Algebra: Vector Operations

Answer

A

Column Vector by 2 = multiply every element with 2
Add 2 vectors = add each element of the vector

Question 48

Q

Linear Algebra: Vector by Vector multiplication (2 column vectors)

Answer

A

It is also called dot product or inner product.
Vector U by Vector V. In numpy it’s element by element multiplication in the form of a vector but in linear algebra, a number is produced by multiplying each element separately and then adding the result.

Question 49

Q

Linear Algebra: Row Vector by column Vector multiplications

Answer

A

Row vector is transposed and multiplied with column Vector
uTu= Sum(i=1,n) uiu

Shape should be same.

Question 50

Q

Numpy: Dot product

Answer

A

We can implement a function but we also have a function in numpy which is called u.dot(v)

Question 51

Q

Linear Algebra: Matrix and Vector Multiplication

Answer

A

For each row of the matrix we multiply with the column Vector.
U0TU to Uk-1TU.
Number of columns in matrix should match with number of rows in column Vector or element of column vector
In numpy, we can use dot function for matrix to vector multiplication i.e. u.dot(v)

Question 52

Q

Linear Algebra: Matrix by Matrix multiplication

Answer

A

We take the first matrix as whole and we divide the second matrix as vector columns.
Then we multiply each matrix with vector column.
We check the dimensions as number of columns in matrix is equal to number of elements or rows in vector column.
The size of the output matrix is rows from first matrix and number of columns come from second matrix.
With numpy we can do U.dot(v)

Question 53

Q

Linear Algebra: Identity Matrix

Answer

A

A square matrix with 1s on diagonal and zeros everywhere else.
If I is multiplied by U matrix, we get U back.
In numpy, np.eye(3)

Question 54

Q

Linear Algebra: Matrix Inverse

Answer

A

Matrix A and inverse of Matrix A-1, when multiplied returns an identity matrix.
So only square matrix has the inverse.
Square matrix is the one with rows and columns equal
Inverse of matrix in numpy is calculated as np.linalg.inv(Vs)
This is quite useful for linear regression.

Question 55

Q

Pandas

Answer

A

It’s a library in python for manipulating tabular data

Question 56

Q

Pandas: Dataframe

Answer

A

Dataframe is a table.
df = pd.DataFrame(data, columns)
data is list of list and columns is a list
Each sublist is a row in the dataframe.
We can also define a list of dictionaries where each dictionary is a row and keys are columns.
We can use head() to check first few rows of the data. E.g. df.head(n=5)
Every column is a pandas series

Question 57

Q

Pandas: Series

Answer

A

Accessing a column in pandas e.g.
df.Make or df[‘Make’]
Subset of columns e.g. df[[‘Make’, ‘Model’]]
Adding another column to the dataframe e.g. df[‘id’] =[1, 2,3,4,5]
To delete a column e.g. del df[‘id’]

Question 58

Q

Pandas: Index

Answer

A

IDs of rows in pandas is index
e.g. df.index
df.Make.index
We can access elements of the dataframe using this index
e.g. df.loc[1] returns the row
We can also generate multiple rows e.g. df.loc[[1, 2]]
We can replace an index too e.g. df.index=[‘a’, ‘b’]
We can use positional index.
We will have to use iloc
e.g. df.iloc[[1, 2, 4]]
df.reset_index(drop=True)

Question 59

Q

Pandas: Element-wise operations

Answer

A

We can apply all operations that can be done on numpy array e.g. division, subtraction, Multiplication etc but we do it on pandas series. If there is a null value it won’t process it and simply return NaN.

We can use comparators and do filtering as well.

Question 60

Q

Pandas: String and Summarizing operations

Answer

A

String operation include str.lower(), str.upper(), str.replace(value-to-replace, value-to-replace-with) etc

Summarizing operations e.g. sum(), min(), max(),mean(), describe() function can tell about all the statistics of column e.g. mean, std deviations etc we can do describe on both series and dataframe

These are applied on pandas series so it updates each element of the pandas series.

Question 61

Q

Pandas: Unique Values

Answer

A

Number of unique values for a single column and for the dataframe where it gives number of unique values for each column in the dataframe.
df.nunique()

Question 62

Q

Pandas: Null Values

Answer

A

df.isnull().sum()
It provides sum values across each column where there is a null value.

Question 63

Q

Pandas: Underlying numpy

Answer

A

df.col.values return an underlying numpy array
df.to_numpy() also returns a numpy matrix

Question 64

Q

Regression Steps

Answer

A

1) Select relevant columns and create a numpy matrix X.
2) Compute matrix - matrix multiplication between X transpose and X called XTX.
3) Take the inverse of XTX.
4) Create a target variable y.
5) Multiply the inverse of XTX with the transpose of X and then multiply by y.
6) This is called w.

Answer 65

A

1) Prepare Data and do EDA (Exploratory Data Analysis)
2) Use a model e.g. linear regression
3) Evaluate the model with RMSE
4) Feature Engineering
5) Regularization
6) Using the model