AI+ML+DS Flashcards
- Deep Learning, Goodfellow - Artificial Intelligence, Norvig - Introduction to Statistical Learning in Python, Hastie
Knowledge Based Approach to AI
Hard-coding knowledge about the world in formal languages for the computer system to make logical inference rules.
Machine Learning
Subset of artificial intelligence that enables computers to learn from data and improve their performance on a specific task without being explicitly programmed. Instead of being hand-coded with specific rules, machine learning algorithms can identify patterns and make predictions based on the data they are trained on.
Artificial Intelligence
A broad field of computer science that aims to create intelligent agents, which are systems that can reason, learn, and act autonomously.
In simpler terms, AI involves developing machines that can think and behave like humans.
Deep Learning
A subset of machine learning that uses artificial neural networks with multiple layers to learn from data. These neural networks are inspired by the structure and function of the human brain, and they can learn complex patterns and relationships in data that are difficult for traditional machine learning algorithms to capture.
Solves a central problem in representation learning by introducing representations that are expressed in terms of other, simpler representations
Representation of Data
Refers to the way data is structured and encoded so that it can be processed by machine learning algorithms. The choice of data representation can significantly impact the performance and efficiency of a machine learning model.
Feature
Each piece of information included in the representation of an observation
Representation Learning
Use of machine learning to discover not only the mapping from representation to output but also the representation itself
Autoencoder
Quintessential example of representation learning
Combination of an encoder function, which converts the input data into a different representation, and a decoder function, which converts the new representation back into the original format
Trained to preserve as much information as possible when an input is run through the encoder and then the decoder, but also trained to make the new representation have various nice properties
Encoder
Converts the input data into a different representation
Decoder
Converts the new representation (encoded) back into the original format
Factors of Variation
Concepts of abstractions that help us make sense of the rich variability in the data
Multilayer Perceptron (MLP)
A type of artificial neural network that consists of multiple layers of interconnected neurons. Each neuron takes a weighted sum of its inputs, applies an activation function, and passes the result to the next layer.
Input/Visible Layer
First layer of neural network that contains the variables we are able to observe
Hidden Layer
Layers of a neural network that are not the first (input) or last (output) layers of the network. Extracts increasingly abstract features from the data. Their values are not given in the data; instead, the model must determine which concepts are useful for explaining the relationships in the observed data.
Adaptive Linear Element (ADALINE)
A type of single-layer artificial neural network used for linear regression and classification tasks. It is similar to the perceptron but uses a least mean squares (LMS) algorithm for training, which allows it to learn more efficiently.
Rectified Linear Unit (ReLu)
A popular activation function used in artificial neural networks. It introduces non-linearity into the model, allowing it to learn complex patterns.
How does it work?
- If the input (x) is positive, the output is the input itself.
- If the input is negative, the output is zero.
Linear Algebra
Branch of mathematics that deals with the study of vectors, matrices, and linear transformations. It provides a framework for solving systems of linear equations and analyzing the properties of linear relationships between variables.
Scalar
A single number
Written in italics with lowercase variable names
Can be thought of as a matrix with a single entry
Vector
An array of ordered scalars = 1-D
(each number has a specific location in the array)
Written in lowercase names with bold typeface
Can be thought of as matrices that contain only one column
Matrix
2-D array of scalars
Written in uppercase names with bold typeface
Tensor
N-D array of scalars
Written in uppercase names with bold-tensor typeface
bold-tensor typeface is slightly different than our traditional bold typeface
Matrix Operation: Transpose
Taking the mirror image of a matrix across the main diagonal
Main Diagonal
Diagonal line on a matrix running down to the right, starting from its upper left corner.
Broadcasting
The implicit copying of a scalar to many locations when performing a matrix operation
Turing Test
A test of a machine’s ability to exhibit intelligent behavior indistinguishable from that of a human. It was proposed by Alan Turing in 1950.
Control Theory
Deals with designing devices that act optimally on the basis of feedback from the environment.
4 Approaches to AI
- Acting Humanly: The Turing Test approach
- Thinking Humanly: The cognitive modeling approach
- Thinking Rationally: The “laws of thought” approach
- Acting Rationally: the rational agent approach
Agent
Anything that can be viewed as perceiving its environment and acting upon that environment.
Perceives through sensors
Acts through actuators
Agent = architecture + program
Rational Agent
An Agent that acts so as to achieve the best outcome, or, when there is uncertainty, the best expected outcome
Rational != Perfect
Rationality maximizes expected performance, while perfection maximizes actual performance
Standard Model
General framework for representing and analyzing systems
The focus on the study and construction of agents that “do the right thing”
What is the “right thing”? -> Defined by the standard model
Control Theory: controller minimizes a cost function
Operations Research: policy maximizes a sum of rewards
Statistics: Decision rule minimizes a loss function
Economics: Where a decision maker maximizes utility or some measure of social welfare
Value Alignment Problem
The values or objectives put into the machine must be aligned with those of the human
Behaviors are not “unintelligent” or “insane”; they are a logical consequence of defining winning as the sole objective for a machine
Algorithm
A step-by-step set of instructions or rules to be followed to solve a specific problem or achieve a particular outcome. It’s like a recipe for a computer, providing a precise sequence of actions to perform
Incompleteness Theorem
States that any consistent formal system that is powerful enough to express basic arithmetic statements is necessarily incomplete. In other words, there will always be true statements within the system that cannot be proven or disproven using the axioms and rules of inference within that system.
Computability
Capable of being computed by an effective procedure
Tractability
Refers to the property of a problem that can be solved by an algorithm in a reasonable amount of time. In other words, a tractable problem is one that can be efficiently solved using a computer.
Intractable
Time required to solve instances of the problem grows exponentially with the size of the instances
NP-Completeness
A concept in theoretical computer science that refers to a class of decision problems that are considered to be among the most difficult to solve. These problems are characterized by the fact that while it is relatively easy to verify a solution, it is extremely difficult to find a solution.
Decision Theory
A field of study that deals with making rational choices in the face of uncertainty. It provides a framework for analyzing and making decisions when there are multiple possible outcomes and associated probabilities.
Game Theory
A branch of mathematics and economics that studies strategic decision-making among rational agents. It analyzes situations where the outcome for one agent depends on the choices made by other agents.
Multiagent Systems
Systems composed of multiple autonomous agents that interact with each other and their environment to achieve a common goal. These agents can be software programs, robots, or even humans.
Operations Research
A field of study that applies mathematical and analytical techniques to solve complex problems that arise in business, industry, and other organizations. It focuses on optimizing systems, processes, and decisions to improve efficiency and effectiveness.
Singularity
A hypothetical future point in time when technological growth becomes uncontrollable and irreversible, resulting in unforeseeable changes to human civilization. It is often associated with the development of artificial intelligence (AI) that surpasses human intelligence in all aspects.
Hebbian Learning
A principle in neuroscience that states that neurons that fire together wire together. This means that if two neurons are frequently activated simultaneously, the connection between them is strengthened. Conversely, if two neurons are rarely activated simultaneously, the connection between them weakens.
Matrix Operation: Dot Product
C=AB
If A is of shape m x n and B is of shape n x p, then C is of shape m x p
A must have the same number of columns as B has rows
A = (2x3) B = (3x2) C = (2x2)
A = [[1, 2, 3],
[4, 5, 6]]
B = [[7, 8],
[9, 10],
[11, 12]]
Dot product of A and B = C:
C = [[17+29+311, 18+210+312],
[47+59+611, 48+510+612]]
Matrix Operation: Element-wise Product
Matrix A and Matrix B must be of same shape. Then you multiply the corresponding elements in each matrix for result.
Matrix Operation: Scalar Product
Multiplying every value of a matrix by a scalar constant
Resulting matrix is the same shape and output is original matrix with each element multiplied by constant scalar
Matrix Notation: Rows and Columns
m x n
m = # of rows
n = # of columns
System of Linear Equations
A collection of two or more linear equations with the same variables. Each equation represents a straight line on a graph, and the solution to the system is the point(s) where these lines intersect.
A set of two or more equations with the same variables. The goal is to find values for the variables that satisfy all of the equations simultaneously.
Identity Matrix
I
Matrix that does not change any vector when we multiple that vector by that matrix
The structure of the identify matrix is simple: all the entries along the main diagonal are 1, while all other entries are zero
Matrix Operation: Inversion
Produces the Inverse Matrix
The inverted matrix is another matrix that, when multiplied by our original matrix, produces the identity matrix.
A * B = B * A = I
Linear Algebra: Origin
The point specified by the vector of all zeros
Linear Combination of vectors
Produced by multiplying each vector, in a list of vectors (matrix), by a corresponding scalar coefficient and then adding the results.
Matrix: Span
The set of all points obtainable by linear combination of the original vectors
Column Span/ Range
Refers to the set of all linear combinations of the columns of a matrix. It is a subspace of the vector space containing the columns.
Linear Dependence
Refers to a relationship between a set of vectors where one vector can be expressed as a linear combination of the others.
In other words, if you can find a set of non-zero scalars (coefficients) such that a linear combination of the vectors equals the zero vector, then the vectors are linearly dependent.
If you can add 2 columns together, after first multiplying them by a scalar constant, to get another column - those columns are linearly dependent.
Linear Independence
True of a set of vectors if no vector in the set is a linear combination of the other vectors
Matrix Attribute: Square
m = n and all columns are linear independent
Square Matrix Attribute: Singular
A square matrix with linear independent columns
Matrix: Norm
Tools used to measure the size of vectors
Functions mapping vectors to non-negative values
Measures the distance from the origin to the point x
Lp norm
Generalization of L2 and L1 norm.
L1 norm (p = 1)
L2 norm (p = 2)
Linf norm (p = inf)
L2 Norm
aka Euclidean Norm
Measures the ordinary Euclidean distance from the origin
L1 Norm
aka Manhattan Norm
Measures the sum of the absolute values of the components of a vector
Used in ML applications when the difference between zero and nonzero elements is very important
L-inf Norm
aka Maximum Norm
Measures the maximum absolute value of components of a vector
Diagonal Matrix
Consists mostly of zeros and have nonzero entries only along the main diagonal
Ex. Identity Matrix
Symmetric Matrix
Any matrix that is equal to its own transpose
Often arise when the entries are generated by some function of two arguments that does not depend on the order of the arguments
Unit Vector
A vector with unit norm
Unit Norm
Refers to a vector whose norm is equal to 1. Vector of length 1.
Matrix: Orthogonal
In linear algebra refers to two vectors that are perpendicular to each other. In other words, if the dot product of two vectors is zero, they are orthogonal.
Orthogonal matrixes are of interest because their inverse is very cheap to compute
Matrix: Orthonormal
Vectors that are not only orthogonal but also have unit norm
Eigendecomposition
One of the most widely used kinds of matrix decomposition
We decompose a matrix into a set of eigenvectors and eigenvalues
Eigenvalue
A fundamental concept in linear algebra that represent the scaling factor associated with a vector when a linear transformation is applied to it.
In other words, an eigenvalue tells you how much a vector is stretched or shrunk when it is multiplied by a matrix.
Eigenvector
Vectors that remain unchanged in direction when a linear transformation is applied to them.
In other words, when a matrix is multiplied by an eigenvector, the result is a scalar multiple of the original eigenvector.
Positive Definite
A matrix whose eigenvalues are all positive
Positive Semidefinite
A matrix whose eigenvalues are all positive or zero
Negative Definite
A matrix whose eigenvalues are all negative
Negative Semidefinite
A matrix whose eigenvalues are all negative or zero
Singular Value Decomposition (SVD)
Provides another way to factorize (decompose) a matrix, into singular vectors and singular values
Enables us to discover some of the same kind of information as the eigendecomposition reveals but the SVD is more generally applicable
Every real matrix has a SVD, but that is not true for the eigenvalue decomposition
Singular Vectors
Closely related to singular values and are obtained from the same decomposition. They provide information about the principal directions of the data represented by a matrix.
Singular Values
A fundamental concept in linear algebra that provide information about the scaling and rotation properties of a matrix. They are closely related to eigenvalues and eigenvectors.
Moore-Penrose Pseudoinverse
Generalization of the inverse matrix for rectangular or non-invertible square matrices.
Trace Operator
Gives the sum of all the diagonal entries of a matrix
Determinant
Scalar value associated with a square matrix
Function that maps matrices to real values. Equal to the product of all eigenvalues of the matrix.
Absolute value of the determinant can be thought of as a measure of how much multiplication by the matrix expands or contracts space
If determinant=0, space is contracted completely along at least one dimension
If determinant=1, then the transformation preserves volume
Agent Function
Describes the agent’s behavior - its mapping from percepts to actions
Policy
Agent Program
Concrete implementation the Agent Function.
In general, an agent’s choice of action at any given instant can depend on its built-in knowledge and on the entire percept sequence observed to date, but not on anything it hasn’t perceived
TODO
Performance Measure
Captures the notion of desirability by evaluating given sequences of environmental states
Return?
As a general rule, it is better to design performance measures according to what one actually wants to be achieved in the environment, rather than according to how one thinks the agent should behave
TODO
Omniscience
The ability to know, for certain, the actual outcome of actions and then act accordingly to achieve goal.
Q: Compare and contrast information gathering and exploration actions
Information Gathering actions are actions taken to acquire predefined information while explorative actions are taken to discover new/unknown information
Information Gathering Actions
Taking actions in order to modify future, predefined, percepts
Ex. Looking both ways before you cross the street
Explorative Actions
Actions taken to discover new ideas, concepts, or possibilities with a focus to broaden understanding.
Task Environment: Components
PEAS
- Performance
- Environment
- Actuators
- Sensors
Task Environment: Properties
- Observability: Fully vs. Partial
- Agent: Single vs Multi
Mulit-agent environments can be competitive or cooperative as well - State-Transition: Deterministic vs Nondeterministic
- Interaction: Episodic vs Sequential
- Stationarity: Static vs Dynamic
- State/Action Space: Discrete vs Continuous
- Model: Known vs Unknown
Q: Difference between Stochastic and Nondeterministic
Stochastic = specified probabilities
(“25% chance of rain tomorrow”)
vs
Nondeterministic = non-specified probabilities
(“chance of rain tomorrow”)
Fully- vs Partially-Observable
Fully-Observable = if an agent’s sensors given it access to the complete state of the environment at each point in time
Partially-Observable = if the sensors only give partial access
Unobservable = sensors are unable to see any of the environment
Single vs Multi Agent Environment
The key distinction is whether object B’s behavior is best described as maximizing a performance measure whose value depends on Agent A’s behavior?
If yes, Multi Agent
If no, Single Agent
Multi Agent can be competitive or cooperative
Deterministic vs Nondeterministic
Deterministic = the next state of the environment is completely determined by the current state and the action executed by the agent(s)
Nondeterministic = otherwise
Episodic vs Sequential
Episodic = Agent’s experience is divided into atomic episodes with 1 time-step each
Sequential = The current decision could affect all future decisions
Checking Defective Parts = Episodic
Chess = Sequential
Stationary (Static) vs. Nonstationary (Dynamic)
Nonstationary = environment can change while agent is operating within it
Stationary = otherwise
Discrete vs Continuous
Describes handling/datatypes of the environment/state, time, and actions of a task.
Known vs. Unknown
Known = Agent has a full transition model of the environment where the outcome of every action is KNOWN
Unknown = otherwise
Agent Architecture
Computing device with physical sensors and actuators that agent program will on
Agent Program: Types
- Simple Reflex
- Model-Based Reflex
- Goal-Based
- Utility-Based
Simple Reflex Agent
Agent that selects actions on the basis of the current percept only, ignoring the rest of the percept history.
Operates via condition-action rules
Model-Based Reflex Agent
Agent that maintains a transition model and sensor model
Goal-Based Agent
Agents that utilize a goal describing situations that are desirable - like being at a particular destination
Agent that has a reward signal of 1 only at the end of the episode.
Utility-Based Agent
Agents that utilize a goal but also a utility function to understand if one state is more desirable than another state on the way to achieving the goal
Value Function
Condition-Action Rule
if A then B
aka
situation-action rule
production
if-then rule
Transition Model
Models “how the world (environment) works”
Sensor Model
Models how the environment is reflected by the agent’s percepts of it
Learning Agent: Components
- Learning Element
- Performance Element
- Critic
- Problem Generator
Learning Element
Element responsible for making improvements to the agent
Uses feedback from the critic on how the agent is doing and determines how the performance element should be modified to do better in the future
Performance Element
Element responsible for selecting external actions.
Takes percepts and decides on actions
Critic
Determines how well the agent is doing
Problem Generator
Responsible for suggesting actions that will lead to new and informative experiences.
Responsible for exploration
Agent Components: Representation
- Atomic
- Factored
- Structured
- Distributed Representation
Atomic Representation
Each state of the world is indivisible - it has no internal structure
Factored Representation
Splits each state up into a fixed set of variables/attributes, each of which can have a value.
Structured Representation
Considers states but also the relationships between them.
Underlies relational databases and first-order logic, first-probability models, and much of natural language
Statistical Learning
Vast set of tools for understanding data. These tools can be supervised or unsupervised.
Supervised Learning
Involves building a statistical model for predicting, or estimating, an output based on one or more inputs
Unsupervised Learning
There are inputs but no supervising output (like supervised learning); nevertheless, we can learn relationships and structure from such data.
Regression
Predicting a continuous/quantitative output value
Supervised
Problems with a quantitative response
Classification
Predicting a categorical/qualitative output value
Supervised
Problems with a qualitative response
Clustering
Grouping elements together according to their observed characteristics
Unsupervised
The Four ISL Premises
- Many statistical learning methods are relevant and useful in a wide range of academic and non-academic disciplines, beyond just the statistical sciences
- Statistical learning should not be viewed as a series of black boxes
- While it is important to know what job is performed by each cog, it is not necessary to have the skills to construct the machine inside the box
- It is presumed that the reader is interested in applying statistical learning methods to real-world problems
Input Variables
aka predictors, independent variables, features, variables.
Notation: X
Output Variables
aka response, dependent variable
Notation: Y
General Form of Relationship
Y = f(X) + epsilon
Relationship: epsilon
random error term
Nonsystematic Information
Independent of X and has mean zero
Relationship: f(X)
Systematic Information
Q: Why estimate f?
- Prediction
- Inference
Reducible Error
Error introduced because of the chosen model.
Reducible because choosing a different model could yield more or less error if model is a better representation of the true relationship.
Irreducible Error
Even if our model was perfect, there would still be error because Y is a function of f(X), our model, AND epsilon, our nonsystematic error term.
Because of epsilon, error is always present in our model and is irreducible.
Parametric Models
Method where the problem of estimating our model is reduced to estimating a set of parameters.
Involve a two-step model-based approach:
1. We make an assumption about the functional form, or shape, of our model (for example linear for a linear model)
2. Select a procedure that uses our training data to fit/train our selected model.
Ex. (1) Linear Regression + (2) Ordinary Least Squares
Nonparametric Models
Methods that do not make explicit assumptions about the functional form our model will take
Instead, they seek an estimate that is as close to the data points while still being generalizable (not overfit)
Because there is no functional form assumption, a greater amount of data is required in order to obtain an accurate estimate for f.
Overfitting
Artifact of model and training process
If the model is too flexible, it could learn the noise/errors of the data and not the overall relationship and therefore not be a generalized solution
Trade-Off Between Prediction Accuracy and Model Interpretability
The more accurate the model, the less interpretable it is because it is usually nonparametric rather than parametric.
Semi-Supervised Learning
In this method, a model is trained on a dataset that contains a small amount of labeled data and a large amount of unlabeled data.
Essentially using a model to learn on a small batch of labeled data and then using said model to label the unlabeled data.
How can we measure how well model predictions actually match the observed data?
How can we measure the quality of fit?
Fundamentally, we need to measure the distance between unseen data (test data) and our model’s predictions on said data.
Regression:
- MSE = mean squared error
-
Classification:
- Error Rate
-
Mean Squared Error (MSE)
1/n*sum( (y_true-y_pred)^2 )
Classification: Error Rate
1/n*sum( I(y_true!=y_pred) )
Computes the fraction of incorrect classifications
We want this value to be minimized
Where I(y_true!=y_pred) = indicator variable and takes the form of 0 or 1 depending on y_true!=y_pred result.
I = 1 if y_true!=y_pred (misclassed)
I = 0 if y_true =y_pred (correct)
Bias1
Model Driven Error
There is a true relationship in the data that we are trying to model and every model we use will not model that relationship perfectly, but some will model it better than others.
If we have a nonlinear relationship in our data and try to build a model using linear regression, it will not model it very well and have high bias.
We could use a nonparametric model and estimate the relationship much better.
Variance
Data Driven Error
Variance refers to the amount by which our model would change if we estimated it using a different training data set.
Fundamentally, if we use different training data sets, we will generate a different model - and each of these models should be similar given similar training data
High Variance = small changes in training data result in large changes to our model
Bias-Variance Trade-Off
Bias = Model Driven Error
Variance = Data Driven Error
A model that estimates the relationship very well (low bias) will be more susceptible to small changes in the data (high variance)
While models that estimate the relationship more generally (high bias) will be less susceptible to small changes in the data (low variance)
As a general rule, as we use more flexible methods, the variance will increase, and the bias will decrease.