ai Flashcards

1
Q

what is an agent in a search problem?

A

An independent program or entity that interacts with its environment by perceiving its surroundings via sensors, then acting through actuators or effectors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what does it mean for a search problem to be observable?

A

The Agent is familiar with the complete state of the environment at a given time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what does it mean for a search problem to be discrete?

A

There are a finite number of actions for any state in the environment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what does it mean for a search problem to be known?

A

the environment has been explored, the best ways of solving it are known and the constraints and criteria are well-defined. the agent can determine which states are reached by which action

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what does it mean for a search problem to be deterministic?

A

Each action has exactly one outcome. The problem can be solved in polynomial time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what are the main components of a search problem?

A

Initial state, Actions, Transition model (which actions take one state to another), Goal test (is a state the goal or not) and Path cost

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what 4 characteristics help us evaluate the performance of a search algorithm?

A

completeness (if there is a solution, the algorithm will always find it), optimality (if there are many solutions, the algorithm will always find the best one), time complexity, space complexity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what are 2 slightly better variations of depth-first search?

A

DFS- Less memory usage:
delete fully explored subtrees from memory in runtime

DFS- Depth-limited Search:
only search the top n-layers of the tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is best-first search?

A

an informed search algorithm that checks the node of lowest cost next using an evaluation function ( f(n) node -> cost estimate ). the function will contain a heuristic which is an estimate of how close the node is to the goal state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is a greedy best-first search?

A

a best-first search algorithm where f(n) = h(n). the node with the lowest heuristic function will always be checked next.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is A* search algorithm?

A

a best-first search algorithm that uses sets that store nodes that need to be explored (initially just the starting node)/ have been explored.
While the open set is not empty:
select node of lowest cost from the open set (current node) and move it to the closed set. If current node is the goal, exit loop. generate new non-closed neighbours by putting them in the open set and calculating the cost. If they’re already in the open set, update if the cost is lower. If open set becomes empty before goal, there is no path. otherwise trace back to find path

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is the evaluation function for the A* algorithm?

A

the A* algorithm uses the evaluation function f(n) = g(n)+h(n) where g(n) is the cost to reach the node and h(n) is the heuristic from the node to the goal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you formulate an optimisation problem?

A

design variable: a variable that represents an attempt at optimising the problem. for example, a list representing a path in a shortest path algorithm
Objective function: a function that inputs the candidate solution (objective function) and outputs the cost/ quality of it
constraints: functions that must be in a range for the solution to be valid such as f(x)<100 where f(x) calculates distance and 100 is the maximum distance that a lorry driver can travel in a day

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is hill climbing?

A

an optimisation algorithm where the starting solution is random. then possible solutions are checked that are close to the starting solution. if any of them are higher quality, take that as the next current solution. keep repeating until the current solution is the best out of its neighbours.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is the disadvantage of hill climbing?

A

the algorithm may discover a local maximum rather than the global maximum solution. also can get stuck on plateaus.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is a greedy algorithm?

A

A greedy algorithm selects the best available option at each decision point without considering the overall future consequences or examining the entire search space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

what is simulated annealing?

A

A hill-climb variation that compares a random neighbour to the current solution and has a probability to make it the new current solution even if it is lower quality. this probability decreases as time goes on by making it equal to e^ (ΔE/T) where ΔE is the quality of the neighbour - that of the current solution. T is temperature and decreases each iteration by a factor such as 0.95. terminates when minimum T is reached current solution stops changing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

what must be defined in a hill-climbing problem?

A

Representation and Initialisation: define what form the candidate solution comes in and the boundaries that it is set by
Define Neighbours: give an algorithm that randomly generates a neighbour that still follows the constraints

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

how does the gradient descent function work with a function of only one parameter f(x)?

A

the function finds the local minimum point on a function f(x) by taking a point x0 and x and finding the gradient. Then if the gradient is positive, the next x value x1 is an amount lower than 0 equal to the gradient * a constant called learning rate. If the gradient was negative, x1 would move in the opposite direction. This is repeated until the step size (learning rate*gradient) gets too small.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

how does the gradient descent function work with functions of more than one parameter f(x,z,…)?

A

the gradient is a vector [x,z,…] that can be calculated by differentiating the function with respect to each variable x,z,… creating a vector [ ∂f(x,z,…)/∂x, ∂f(x,z,…)/∂z,…) so for example the gradient vector for f(x,z) = x^4-3z would be [4x^3, -3] and the new point (x1,z1,f(x1,z1)) would be (x0,z0,f(x0,z0)) - α[4x^3, -3]
where α is the learning rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

how do you find an optimal linear regression line on bivariate data?

A

the loss function that will be minimised is the sum of the square of residuals. this is equal to Σ(actual-calculated)^2 = Σ(y-(mx+c))^2. This can be mapped out on a 3 dimensional graph where m and c are on the x and z axes and the sum of the square of residuals is on the y-axis. Now the gradient vector must be found by differentiating the SSR with respect to m and c. Then m and c can be set initially as 1 and 0 for example and refined using the gradient vector and a learning rate until the step size gets small enough where the refinement is complete.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

what is the sigmoid function?

A

1/(1+e^-x) where y is an unspecified line in the same number of dimensions as the data. for example if the data is 3D vectors, x would be a+bx1+cx2^2. the function outputs a number from 0 to 1 to calculate the likelihood of the data having a certain characteristic (>0.5, we predict that the input is 1 otherwise we predict 0). this can also be used for logistic regression where 0 represents one group and 1 represents another such as if the data is tumours 0 is benign and 1 is cancerous. the points would be tumours that are known to be cancerous or not and would be mapped on either x=1 or x=0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

what is the loss function that measures logistic regression

A

maximum likelihood is the sum of the likelihood that the points will be mapped onto the correct side of the function.

24
Q

how do you write the probability of a point being 1 mathematically?

A

h(x;w) = σ(wᵀx)= σ(w₀+w₁x₁+w₂x₂+…) =P(y=1|x;w) where σ is the sigmoid function
the number of variables in the linear input varies as ᵀ changes. in x;w, x is the variables (x₁,x₂,…) and w is the constants (w₀,w₁,…)

25
Q

what is the cost function for logistic regression?

A

g(w) =1/N Σ (0→n) Cost(h(xⁿ;w),yⁿ) where Cost(h(x;w),y) = { -log(h(x;w)) if y=1,
-log(1-h(x;w)) if y=0 }
This can be simplified to:
-1/N Σ (0→n) ( (yⁿ)log(h(xⁿ;w))+(1-yⁿ)log(1-h(xⁿ;w)) )

26
Q

what is the gradient vector of the cost function in linear regression?

A

∇g(w) = (h(xⁿ;w)-yⁿ)*xⁿ

27
Q

what is a parametric model?

A

a model that uses a finite set of coefficients that can be changed in order to optimise the cost.

28
Q

what is the Minikowski Distance (Lᵖ Norm)

A

measures distance between two points and is defined by:
Lᵖ(xᵃ,xᵇ) = ᵖ√ [ Σ(j=1→n) |xⱼᵃ - xⱼᵇ|² ]
In general, p=2 when using it for k-NN

29
Q

what is the K Nearest Neighbours model

A

finds the (normalised) Lᵖ Norm between the given point and all the points in the training data. then keep the k items from the training data with least distance. last, for categorisation put the input it in the category that has most of the k neighbours in. for regression, predict the average.

30
Q

why is it necessary to normalise data before applying the k-NN algorithm?

A

independent variables might have different scales meaning some impact the result more than others.

31
Q

how can distance be calculated with binary data (true/false) and categorical data (samsung, apply, nokia)

A

binary data: make yes/no equal to 1 and 0 before calculating distance
categorical data: if the values are the same, distance is 0, otherwise its 1

32
Q

what is holdout-validation?

A

use a random 70% of the training set to train multiple models. Then assess validation error of each model on the remaining 30% (validation set) and keep the one with the lowest error. mean squared error for linear regression and 1- (# wrong predictions/ # predictions) for classification.

33
Q

what is k-fold cross-validation?

A

randomly split the training data into k-parts. train each model on every combination of k-1 parts and use the remaining k part to assess the validation error. find the average error and repeat for each model. take the model with the lowest average error.

34
Q

what is leave-one-out validation?

A

same as k-fold cross-validation but where k= cardinality of the training set

35
Q

what is unsupervised learning?

A

doesn’t use training data but finds patterns in the data with know prior knowledge about the output or target variable.

36
Q

what is dimensionality reduction?

A

removing some parameters that are deemed less important/ unimportant

37
Q

why is normalisation necessary and what are the two methods and their drawbacks?

A

normalisation ensures attributes contribute equally to the similarity measure.
min-max normalisation: xⱼⁱ ::= (xⱼⁱ - xⱼ,ₘᵢₙ) / (xⱼ,ₘₐₓ - xⱼ,ₘᵢₙ) sensitive to outliers
z-score standardisation: xⱼⁱ ::= (xⱼⁱ - μⱼ) / σⱼ doesn’t ensure the values are bounded by a range

38
Q

what is a distance matrix?

A

a 2D matrix with all data points along both dimensions and the distances in the entries

39
Q

what makes a set of clusters on data good quality?

A

high intra-cluster similarity (WCSS): points in a cluster are similar/close to each other
low inter-cluster similarity (BCSS): points are dissimilar/ far away from points in other clusters

Total Sum of Squares (TSS) = BCSS + WCSS. maximising one minimises the other

40
Q

how is intra-cluster dissimilarity (Within Cluster Sum of Squares/ WCSS) measured?

A

variability(C) = Σ (e∈C) d(e,Centroid(C)) where the centroid of a cluster is the average of all the points in the cluster. distance is commonly measured with squared Euclidian distance: (x₂ - x₁)² + (y₂ - y₁)² + … + (z₂ - z₁)² for points (x,y,…z)
dissimilarity (𝐂) = Σ (C∈𝐂) (variability(C)) needs to be minimised
so: dissimilarity = Σ (C∈𝐂) (Σ (e∈C) d(e,Centroid(C)))
SUM OF DISTANCES OF EACH POINT TO ITS RESPECTIVE CENTROID

41
Q

how is inter-cluster dissimilarity Between Cluster Sum of Squares/ BCSS) measured?

A

variability(C) = Σ (e∈C) n꜀*d(Centroid(C),Centroid(data))
where n꜀ is the number of points in cluster C
SUM OF DISTANCES OF CENTROIDS TO CENTRE OF DATA

42
Q

how does the k-means algorithm work?

A

a partitional clustering algorithm that selects k data points as centroids for clusters and assigns the other points to the nearest centroid. then repeat following steps until convergence:
calculate the new centroid of the new clusters. assign the points to the centroids.

43
Q

what are 2 challenges the k-means algorithm faces and how can they be solved?

A

might converge at a local minimum: to overcome, do multiple runs with different randomised initial centroids and choose the best result. alternatively, randomly select the first centroid and then choose the other based on probability proportional to distance^2 from it to the first centroid.

choosing the wrong value for k changes the result dramatically: to overcome, either use knowledge of the dataset or run the algorithm with multiple k-values and choose the k at the point where returns are diminishing on the k/dissimilarity graph.

44
Q

what are the two types of Hierarchical Clustering?

A

agglomerative: start with each cluster being a single entry and merge clusters with smallest inter-cluster dissimilarity until one is left. output a dendrogram
divisive: start with all entries in one cluster and keep splitting (maximising inter-cluster dissimilarity) until each entry is a cluster. output a dendrogram

45
Q

how is inter-cluster dissimilarity measured between two clusters (linkages)?

A

single linkage: shortest distance from any member of the cluster to the other cluster
complete linkage: largest distance from any member of the cluster to any member of the other cluster
group average: average of distances between members of the two clusters

46
Q

what is a dendrogram and how can it be used to obtain a result from hierarchical clustering?

A

a dendrogram is a tree a dissimilarity/ discrete entries axes. when two clusters are merged, they are joined at the y value that represents their dissimilarity. drawing a horizonal line represents a cut-off to obtain a certain number of clusters or level of similarity.

47
Q

what is a validation criterion?

A

an index used to measure adequacy of clusters meaning assesses how well the clusters provides true information about the data or reflect the intrinsic character of the data.

48
Q

what are the different types of validation criteria?

A

unsupervised (internal indices): doesn’t use external information about the data for example WCSS
supervised (external indices): measures the extent to which a clustering algorithm matches some external structure for example entropy
relative: compares two clusters using either types of validation above

49
Q

what unsupervised algorithms can be used to measure partitional algorithms (k-means)?

A

variability and separation algorithms such as BCSS and WCSS or silhouette coefficient:
SC of one example is (aᵢ-bᵢ)/max(aᵢ,bᵢ) where aᵢ is average distance of the iᵗʰ example to all the other entries in its cluster and bᵢ is the minimum average distance to the other clusters
SC of a cluster/clustering is the average SC of examples in the cluster/clustering

50
Q

how can the silhouette coefficient be used to find the optimal number of clusters?

A

take the number of clusters at the peak of the graph with number of clusters on the x-axis and average SC on the y-axis

51
Q

what unsupervised algorithm can be used to measure hierarchical algorithms?

A

cophenetic correlation coefficient (CPCC): [∑ (Dᵢ,ⱼ-d)(Pᵢ,ⱼ-p)] / √[∑(Dᵢ,ⱼ-d)∑(Pᵢ,ⱼ-p)]
where Pᵢ,ⱼ is the cophenetic distance between i and j found by reading the dissimilarity of the first linkage of i and j off of the dendrogram. d and p are the average distances and cophenetic distances

52
Q

what are the two types of supervised validity criteria for partitional algorithms?

A

classification-oriented: entropy, purity and precision, recall and F-measure
similarity-oriented: Jaccard measure and Rand statistic
they both use external information about what classes the data actually splits into

53
Q

what is precision, recall and F-measure?

A

precision p(i,j): #examples of class j in cluster i / #examples in cluster i
recall r(i,j): #examples of class j in cluster i / #examples in class j
F-measure f(i,j): 2p(i,j)r(i,j) / (p(i,j)+r(i,j))

54
Q

what is entropy of a cluster and of a clustering

A

Entropy of the iᵗʰ cluster: eᵢ = -∑ p(i,j)log₂p(i,j)
Total entropy: e = ∑ eᵢ
(#examples in cluster i / #examples)
low entropy means low uncertainty and disorder

55
Q

what is purity of a cluster and of a clustering

A

Purity of the iᵗʰ cluster: pᵢ = maxⱼ p(i,j)
Total Purity: p = ∑ pᵢ* (#examples in cluster i / #examples)
high purity means better separation of classes and likelihood that entries are in the right class

56
Q

what is the Jaccard measure and Rand statistic?

A

fₓᵧ: #examples when x/y=1,0 share/ don’t share cluster/class
Jaccard measure: (f₁₁)/(f₀₁+f₁₀+f₁₁)
Rand statistic: (f₀₀+f₁₁)/(f₀₀+f₀₁+f₁₀+f₁₁)