Artificial Intelligence and natural language Processing(CS404) Flashcards
What is subjectverb agreement and how is it enforced on a parse tree? How
does subjectverb agreement contribute to the process of parsing the
following sentences?
(i) I am a man
(ii) I are a man *
Agreement in Number + Person between the
subject and verb of a sentence.
The parse tree identifies the subject and verb of a sentence as separate nodes and can identify if features like number (singular or plural) and person (first, second, or third) are consistent
(i) {1S}∩{1S} = {1s}
(ii) {1S}∩{3P} = Ø
What is a synset – such as those found in WordNet? Describe how WordNet
makes use of any two relations to structure its knowledgebase. Make use
of examples to illustrate your answer.
A synset is a distinct meaning for each word.
hyponymy relation refers to “is a kind of”
a polar bear is a kind of bear.
hypernym(general term) ie bear
hyponym(specific term) ie polar bear
instance hyponymy relation refers to “is an instance of”
“Einstein is an instance of a physicist”
this is different from regular hyponymy relation because it is not a subcategory it is an example.
Meronymy relation refers to “is a member of”
the (meronym) is a member of (the holonym).
a tree is a member of a forest.
Describe any two similarity metrics used to compare WordNet synsets.
Path which is based on the number of connections between the two synsets.It is calculated as 1 over the path length + 1.
Wu & Palmer (WuP) Method which is based on the least common subsummer(LCS) which is the clostest common ancestor of the two synsets. It is calculated by 2* LCS depth over synset1 depth plus synset2 depth.
lin method is Information content is based on the probability of encountering an instance of a synset in a specific corpus. It is calculated by 2* LCS information content over synset1 information content plus synset2 information content.
Write down Google’s PageRank formula. What do each of the terms in this
formula signify?
The PageRank of a page is the chance that
a random surfer will land on that page.
PR(A): PageRank of page A. This is what the algorithm is trying to calculate - the ‘importance’ or ‘authority’ of page A on the web.
(1-d): This is a damping factor which is a small constant (usually set around 0.85). It represents the probability of a random user choosing to follow a random link on a web page. The idea is that a user does not always follow outbound links from a given webpage and sometimes jumps to a random page.
d: This is the complementary probability to the damping factor (1-d). It represents the probability of a user following an outbound link from a page. It’s a way to model the behavior of a user sticking to a chain of links.
PR(Ti): PageRank of pages Ti which link to page A. These are the pages that have outbound links pointing to page A. Their PageRank contributes to the PageRank of page A.
C(Ti): The number of outbound links on page Ti. This is used to divide the PageRank of Ti. The idea is that the ‘vote’ of a page is split between all the links it contains; the more links, the less ‘voting power’ each link has.
The sum: The formula sums the contributions of all pages linking to page A. Each contribution is the PageRank of the linking page divided by the number of links on that page.
What is the relevance of a random walk to Google’s PageRank function?
Make use of an appropriate diagram to illustrate your answer.
The relevance of a random walk to PageRank is that it represents the likelihood that a person randomly clicking on links will arrive at a particular page. The pages that are more ‘reachable’ in this random walk (i.e., more frequently encountered) are deemed to be more important, and thus receive a higher PageRank score.
(add diagram)
Consider the following sentence
“John is a pig.” Compare a literal interpretation of this sentence to a metaphoric
interpretation. What are the main differences between these two
interpretations?
A literal interpretation of “John is a pig” means John is actually the animal, a pig. A metaphoric interpretation suggests John has characteristics associated with pigs, like messiness or greed. The main difference is that the literal interpretation takes the words at face value, while the metaphoric interpretation uses “pig” symbolically to describe John’s traits.
What is meant by the “features” of a word, and what is the relation
between features and “agreement”? Briefly describe three different
categories of agreement and how they may be enforced on a parse tree.
Illustrate your answer using appropriate examples in English, Irish,
French or other natural language.
“features” of a word refer to its attributes like part of speech, tense, number, gender, and semantic properties that define its role and meaning in language.
The features and “agreement” involves ensuring that words in a sentence match in certain features, such as number, gender, or case.
Agreement in Number + Person between the
subject and verb of a sentence.
The parse tree identifies the subject and verb of a sentence as separate nodes and can identify if features like number (singular or plural) and person (first, second, or third) are consistent
I am tired
{1S}∩{1S} = {1s}
He is tired
{3S}∩{3s} = {3s}
*He am tired
{3S}∩{1S} = {} or Ø
*I is tired
{1S}∩{3S} = Ø
Quantifier Noun agreement
Agreement in number between Noun and quantifier
– This man
*This men
– These men
*These man
Determiner Noun agreement (in French etc.)
Gender agreement between Determiner and Noun
Le stylo
*La stylo
– La femme
*Le femme
Cardinal-number Noun agreement (in Irish)
Agreement in number the ordinal term and Noun
Beirt {daoine}
* dha daoine
– Dha sciain
*beirt {sciain}
Discuss the use of WordNet to represent semantic information. Describe
any two relationships used by WordNet to organize its taxonomy.
WordNet is a widely-used lexical database that represents semantic information of English words. It organizes words into sets of synonyms (synsets), each representing a unique concept. These synsets are linked by various types of semantic relationships, creating a rich network that reflects the structure of human language.
hyponymy relation refers to “is a kind of”
a polar bear is a kind of bear.
hypernym(general term) ie bear
hyponym(specific term) ie polar bear
instance hyponymy relation refers to “is an instance of”
“Einstein is an instance of a physicist”
this is different from regular hyponymy relation because it is not a subcategory it is an example.
Meronymy relation refers to “is a member of”
the (meronym) is a member of (the holonym).
a tree is a member of a forest.
Describe in detail any one similarity metric for comparing two terms,
which is based on the WordNet lexical database.
Path which is based on the number of connections between the two synsets.It is calculated as 1 over the path length + 1.
Wu & Palmer (WuP) Method which is based on the least common subsummer(LCS) which is the clostest common ancestor of the two synsets. It is calculated by 2* LCS depth over synset1 depth plus synset2 depth.
lin method is Information content is based on the probability of encountering an instance of a synset in a specific corpus. It is calculated by 2* LCS information content over synset1 information content plus synset2 information content.
What is meant by metaphoric language and how does metaphoric
language differ from more standard (literal) language? Illustrate your
answer with suitable examples.
Metaphoric language uses words symbolically, unlike literal language which uses words in their standard, direct sense. For example, saying “He has a heart of stone” metaphorically suggests he is unemotional or unsympathetic, while literally it would mean his heart is made of stone. In contrast, a literal statement like “He is unkind” uses words in their usual, direct meaning.
Write down the PageRank function, explaining each of the terms.
Describe the distinction between inlinks and outlink, using an appropriate
diagram to illustrate your answer.
The PageRank of a page is the chance that
a random surfer will land on that page.
PR(A): PageRank of page A. This is what the algorithm is trying to calculate - the ‘importance’ or ‘authority’ of page A on the web.
(1-d): This is a damping factor which is a small constant (usually set around 0.85). It represents the probability of a random user choosing to follow a random link on a web page. The idea is that a user does not always follow outbound links from a given webpage and sometimes jumps to a random page.
d: This is the complementary probability to the damping factor (1-d). It represents the probability of a user following an outbound link from a page. It’s a way to model the behavior of a user sticking to a chain of links.
PR(Ti): PageRank of pages Ti which link to page A. These are the pages that have outbound links pointing to page A. Their PageRank contributes to the PageRank of page A.
C(Ti): The number of outbound links on page Ti. This is used to divide the PageRank of Ti. The idea is that the ‘vote’ of a page is split between all the links it contains; the more links, the less ‘voting power’ each link has.
The sum: The formula sums the contributions of all pages linking to page A. Each contribution is the PageRank of the linking page divided by the number of links on that page.
In PageRank, inlinks are links from other pages to a given page, contributing to its rank. Outlinks are links from that page to others, distributing its PageRank among them. Inlinks boost a page’s rank, while outlinks share it.
(add a diagram)
What is heuristic search? Describe a simple heuristic that might be useful in
solving the Farmer-Fox-Goose problem (or the 8-tile puzzle). Briefly
describe what problem is presented a local minimum and a plateau
To solve problems like the 8-tile puzzle or the Farmer-Fox-Goose problem, a search mechanism typically involves defining a state space, where each state represents a possible configuration of the problem. The search algorithm then systematically explores this space, moving from one state to another via legal moves or actions, until it reaches a goal state.
For example, in the 8-tile puzzle, states represent different tile arrangements, and actions are tile moves. In the Farmer-Fox-Goose problem, states are different distributions of the characters across the river, and actions are the farmer’s choices of who to ferry across the river.
Common search algorithms used include depth-first search, breadth-first search, or more advanced strategies like A* search, which can be guided by heuristics to find the optimal solution efficiently.
What is meant by a zero-sum game? Describe how the MiniMax algorithm
can be used to solve adversary-search problems, such as chess or the OX0
game.
A zero-sum game is a situation in which one participant’s gain or loss is exactly balanced by the losses or gains of the other participants. The MiniMax algorithm is used in zero-sum games like chess or tic-tac-toe to find the optimal move: it evaluates the possible moves and their outcomes to maximize the player’s advantage while minimizing the opponent’s, assuming both players play optimally.
D value of 1.
Describe each of the steps in the A* search algorithm.
Initialization: Start with an open list containing only the initial node and a closed list that is empty.
Loop: Repeat the following steps until the open list is empty:
Choose Node: Select the node with the lowest score from the open list. The score is calculated as
f(n)=g(n)+h(n), where
g(n) is the cost to reach the node, and
h(n) is the estimated cost from the node to the goal (heuristic).
Goal Check: If the chosen node is the goal, backtrace the path from the goal to the start and return this path.
Generate Successors: Identify all the neighbors of the current node and calculate their scores.
Update Lists: For each neighbor, if it is not in the open list, add it. If it is already in the open or closed list with a higher score, update it with the lower score and change its parent to the current node.
Move Node: Move the current node from the open list to the closed list.
Completion: The algorithm terminates when either the goal is found (and the path is returned) or the open list is empty (indicating no path).
What is WordNet and how does it represent information? Make reference to the relationships that are used by WordNet in your answer.
WordNet is a large lexical database of English, linking words into semantic relations including synonyms, antonyms, hypernyms, hyponyms, meronyms, and holonyms. It groups English words into sets of synonyms called synsets, providing short definitions and usage examples, and records the various semantic relations between these synonym sets. The most common relations are:
Synonyms: Words that have the same or similar meanings are grouped into synsets.
Hypernyms: Terms that are more generic or abstract than the given word.
Hyponyms: More specific terms derived from a generic term.
Meronyms: Words that denote a part of something.
Holonyms: Words that denote a whole of which the given word is a part.
Antonyms: Words that have opposite meanings.
WordNet’s structure makes it a useful tool for computational linguistics and natural language processing, providing a rich network of meaningfully related words.
Describe any two similarity metrics used to compare WordNet synsets.
Why would one attempt to quantify the similarity (or difference) between
any two concepts?
Two common similarity metrics used to compare WordNet synsets are:
Path Similarity: Measures the similarity based on the shortest path that connects the synsets in the hypernym/hyponym taxonomy. A shorter path implies higher similarity.
Wu-Palmer Similarity (WUP): Measures the similarity by considering the depths of the two synsets in the WordNet taxonomies, along with the depth of their least common subsumer (most specific ancestor node).
Quantifying the similarity (or difference) between two concepts is important for various natural language processing tasks, such as word sense disambiguation, semantic similarity assessment, information retrieval, and machine translation. It helps in understanding the semantic and contextual relationships between words, thereby improving the performance of these tasks.
Define the terms precision and recall, explaining their relevance to
information retrieval.
In information retrieval, precision and recall are two critical metrics:
Precision: Measures the fraction of retrieved documents that are relevant to the query. It answers the question, “Of all the documents the system retrieved, how many were actually relevant?”
Recall: Measures the fraction of relevant documents that were successfully retrieved. It answers the question, “Of all the relevant documents that exist, how many did the system successfully retrieve?”
These metrics are crucial in evaluating the effectiveness of information retrieval systems, balancing the need for retrieving all relevant documents (recall) while minimizing the retrieval of irrelevant ones (precision).
Write down the PageRank function, explaining each of the terms. Describe
the distinction between inlinks and outlinks, using an appropriate diagram to
illustrate your answer.
The PageRank of a page is the chance that
a random surfer will land on that page.
PR(A): PageRank of page A. This is what the algorithm is trying to calculate - the ‘importance’ or ‘authority’ of page A on the web.
(1-d): This is a damping factor which is a small constant (usually set around 0.85). It represents the probability of a random user choosing to follow a random link on a web page. The idea is that a user does not always follow outbound links from a given webpage and sometimes jumps to a random page.
d: This is the complementary probability to the damping factor (1-d). It represents the probability of a user following an outbound link from a page. It’s a way to model the behavior of a user sticking to a chain of links.
PR(Ti): PageRank of pages Ti which link to page A. These are the pages that have outbound links pointing to page A. Their PageRank contributes to the PageRank of page A.
C(Ti): The number of outbound links on page Ti. This is used to divide the PageRank of Ti. The idea is that the ‘vote’ of a page is split between all the links it contains; the more links, the less ‘voting power’ each link has.
The sum: The formula sums the contributions of all pages linking to page A. Each contribution is the PageRank of the linking page divided by the number of links on that page.
Inlinks: Links from other pages pointing to pi. They contribute to the PageRank of pi as other pages ‘vote’ or ‘endorse’ it.
Outlinks: Links from page pj
pointing to other pages. The PageRank value of pj
is shared among these outlinks.
What problem is presented by encountering a novel word-pair when
calculating the joint probability of a word sequence? How can these
problems be overcome?
Encountering a novel word-pair when calculating the joint probability of a word sequence presents the problem of data sparsity, where the word-pair has not been seen in the training data, leading to a probability estimate of zero. This makes the model unable to handle unseen word combinations effectively.
To overcome these problems, techniques like smoothing and back-off models are used. Smoothing assigns a small probability to unseen word-pairs by redistributing some probability mass from more frequent pairs. Back-off models, on the other hand, fall back to calculating the probability of smaller units (like single words) when the data for larger units (like word pairs) is insufficient or unavailable. These methods help in dealing with the zero-probability issue in sparse data scenarios.
Describe how a search mechanism can be adapted to solve problems like the
8-tile puzzle or the Farmer-Fox-Goose problem.
To solve problems like the 8-tile puzzle or the Farmer-Fox-Goose problem, a search mechanism typically involves defining a state space, where each state represents a possible configuration of the problem. The search algorithm then systematically explores this space, moving from one state to another via legal moves or actions, until it reaches a goal state.
For example, in the 8-tile puzzle, states represent different tile arrangements, and actions are tile moves. In the Farmer-Fox-Goose problem, states are different distributions of the characters across the river, and actions are the farmer’s choices of who to ferry across the river.
Common search algorithms used include depth-first search, breadth-first search, or more advanced strategies like A* search, which can be guided by heuristics to find the optimal solution efficiently.
Describe how the MiniMax algorithm can be used to solve adversary-search
problems, such as chess or the OXO game. What are the main limitations of
the MiniMax algorithm?
The MiniMax algorithm solves adversarial search problems like chess or OXO by simulating all possible moves and their outcomes for both players. It assumes that both players play optimally and alternates between maximizing the player’s advantage and minimizing the opponent’s advantage, thereby choosing the best possible move.
Main limitations of the MiniMax algorithm include:
Computational Complexity: The number of possible moves in games like chess is enormous, leading to a vast search tree that is computationally expensive to explore fully.
Requirement for Effective Heuristics: For deeper search trees, effective heuristics are needed to evaluate game positions and make practical decisions without exploring the entire tree.
D with a value of 3.
Describe how α−β pruning improves the performance of adversary search
algorithms. You may use an appropriate example to illustrate your answer.
Alpha-Beta (α-β) pruning improves the performance of adversarial search algorithms by reducing the number of nodes evaluated in the search tree. It works by eliminating branches that don’t influence the final decision.
In this process, two values, alpha (α) and beta (β), are maintained. Alpha is the best value that the maximizer currently guarantees at that level or above, and beta is the best value that the minimizer currently guarantees. If at any point alpha becomes greater than or equal to beta (α ≥ β), the branch is pruned, as it won’t affect the final outcome.
For example, in a game like chess, if the maximizer has found a move with a value of +3, and on another branch, the minimizer can force a move with a value of -2, the maximizer will never choose this branch, so further exploration of it is unnecessary. This way, α-β pruning skips evaluating many nodes, significantly speeding up the search process.
Discuss how the WordNet represents information. Make reference to the
relationships that are used by WordNet in your answer.
WordNet represents information by organizing English words into sets of synonyms called synsets, each expressing a distinct concept. Synsets are interlinked through various semantic relationships, including:
Synonyms: Words with similar meanings grouped in the same synset.
Hypernyms: Words representing broader concepts, linking more specific words (hyponyms) to general ones.
Hyponyms: More specific words linked to broader ones (hypernyms).
Meronyms: Words denoting a part of something (the part-to-whole relationship).
Holonyms: Words representing the whole to which a part (meronym) belongs.
Antonyms: Words that have opposite meanings.
These relationships form a rich network, facilitating an understanding of word meanings and connections in the English language.
Describe any two similarity metrics used to compare WordNet synsets.
Two similarity metrics commonly used to compare WordNet synsets are:
Path Similarity: This metric measures the similarity based on the shortest path in the hypernym/hyponym hierarchy between two synsets. A shorter path indicates greater similarity. It is computed as the inverse of the path length between the synsets.
Wu-Palmer Similarity (WUP): This metric evaluates the similarity by considering the depths of the two synsets in the WordNet taxonomy, as well as the depth of their most specific common ancestor (the least common subsumer). The similarity is calculated based on the depth of the synsets and their least common subsumer relative to the overall taxonomy depth.
What is meant by a “random walk” of the internet? How does the
random walk help identify important documents?
A “random walk” of the internet refers to a process where a hypothetical “surfer” randomly follows links from one web page to another. This concept is a key part of algorithms like Google’s PageRank. In a random walk, the surfer randomly chooses one of the outbound links on each page to navigate to the next page, continuing this process indefinitely.
The random walk helps identify important documents through the principle that more important or relevant pages are more likely to be visited frequently over many random walks. The idea is that pages with many inlinks (links from other pages) or links from other important pages are more likely to be reached in a random walk. Thus, the frequency of visiting a page in random walks reflects its importance or authority.
PageRank, for example, uses this concept to rank web pages. Pages frequently visited in random walks have higher PageRank scores, implying they are more important or authoritative. This approach helps search engines prioritize relevant and authoritative pages in their search results.
Write down the PageRank function, explaining each of its terms.
What is the difference between inlinks and outlinks? Illustrate your
answer with an appropriate example.
The PageRank of a page is the chance that
a random surfer will land on that page.
PR(A): PageRank of page A. This is what the algorithm is trying to calculate - the ‘importance’ or ‘authority’ of page A on the web.
(1-d): This is a damping factor which is a small constant (usually set around 0.85). It represents the probability of a random user choosing to follow a random link on a web page. The idea is that a user does not always follow outbound links from a given webpage and sometimes jumps to a random page.
d: This is the complementary probability to the damping factor (1-d). It represents the probability of a user following an outbound link from a page. It’s a way to model the behavior of a user sticking to a chain of links.
PR(Ti): PageRank of pages Ti which link to page A. These are the pages that have outbound links pointing to page A. Their PageRank contributes to the PageRank of page A.
C(Ti): The number of outbound links on page Ti. This is used to divide the PageRank of Ti. The idea is that the ‘vote’ of a page is split between all the links it contains; the more links, the less ‘voting power’ each link has.
The sum: The formula sums the contributions of all pages linking to page A. Each contribution is the PageRank of the linking page divided by the number of links on that page.
Inlinks: Links from other pages pointing to pi. They contribute to the PageRank of pi as other pages ‘vote’ or ‘endorse’ it.
Outlinks: Links from page pj
pointing to other pages. The PageRank value of pj
is shared among these outlinks.
Briefly explain what is meant by the “generate and test” approach to
problem solving. Briefly describe how an heuristic search might
differ from blind search when solving problems such as the 8-tile
puzzle or the 15-tile puzzle. What are the main problems with
developing an heuristic search for these games?
The “generate and test” approach to problem-solving involves generating potential solutions and then testing them to see if they meet the criteria for a solution. This method is straightforward but can be inefficient if there are many possible solutions.
In the context of solving puzzles like the 8-tile or 15-tile puzzle, heuristic search differs from blind search in the following ways:
Heuristic Search: Uses heuristics (smart guesses or rules of thumb) to guide the search towards the goal more efficiently. For example, a common heuristic for tile puzzles is the number of tiles out of place or the total distance of tiles from their goal positions. It prioritizes moves that seem to lead more directly to the solution.
Blind Search: Also known as uninformed search, it explores the solution space without any information about the likelihood of a path leading to the goal. It methodically checks all possible moves, which can be highly inefficient for complex puzzles.
The main problems with developing a heuristic search for these puzzles are:
Finding Effective Heuristics: Identifying heuristics that consistently lead to the solution quickly can be challenging. Effective heuristics should estimate the “closeness” to the goal accurately.
Balance Between Efficiency and Accuracy: A heuristic must strike a balance between guiding the search efficiently and not misleading it away from the goal.
Computational Overhead: Complex heuristics can add computational overhead, potentially negating the benefits of a more directed search.
Describe the problem that is presented by a local maximum or a
plateau of the heuristic function. You may use a diagram to illustrate
your answer.
Local Maximum: This is a point where the heuristic function reaches a value higher than its immediate neighbors but not necessarily the highest possible value (the global maximum). The problem arises when a search algorithm, aiming to maximize the heuristic, reaches a local maximum and erroneously concludes that the best solution has been found, even though a better solution might exist elsewhere in the search space.
Plateau: A plateau is an area of the search space where the heuristic function has the same value across multiple states. This flatness in the landscape makes it difficult for the search algorithm to determine which direction to move in, as all neighboring states appear equally promising (or unpromising). The algorithm can get “stuck” on a plateau, unable to find a path that leads to a better solution.
Describe in detail how you would construct an heuristic for one of the
following board games: chess, checkers (draughts), connect-4.
Piece Value: Assign values to each piece based on its importance (e.g., Pawn = 1, Knight/Bishop = 3, Rook = 5, Queen = 9).
Board Evaluation: Calculate the total value of pieces for each player. The board’s score could be the difference in total values, favoring the player with the higher total.
Positional Play: Include bonuses or penalties for specific strategic elements, such as pawn structure, control of the center, king safety, and piece mobility.
Endgame Adjustments: Modify evaluations for the endgame phase, where the king’s activity and pawn structure may become more crucial.
This heuristic helps a chess AI to evaluate board positions, guiding move decisions to maximize its advantage and minimize the opponent’s. The simplicity or complexity of the heuristic can vary based on the desired balance between accuracy and computational efficiency.
Describe the A* search algorithm in detail.
The A* search algorithm is a pathfinding and graph traversal method. It finds the shortest path from a start node to a goal node in a weighted graph. A* uses a best-first search approach, prioritizing paths that seem promising. It combines the actual cost from the start node to a given node (G-cost) with a heuristic estimate of the cost from that node to the goal (H-cost). The sum of these costs (F-cost = G-cost + H-cost) guides the search, with lower F-cost paths explored first. A* terminates when the goal node is reached, ensuring an optimal path if the heuristic is admissible (never overestimates the true cost).
How does WordNet represent lexico-semantic information? Briefly
describe any two metrics for estimating the similarity between terms,
based on WordNet. Refer to specific WordNet relations in your
answer.
WordNet represents lexico-semantic information by organizing words into sets of cognitive synonyms (synsets), each expressing a unique concept. Synsets are interconnected through semantic relations like hypernyms (more general terms), hyponyms (more specific terms), meronyms (part-whole relations), and antonyms (opposites).
Two metrics for estimating similarity between terms in WordNet are:
Path Similarity: Measures similarity based on the shortest path in the hypernym/hyponym hierarchy between synsets. A shorter path indicates greater similarity.
Wu-Palmer Similarity (WUP): Measures similarity by the depths of the synsets and their least common subsumer (most specific common ancestor) in the hypernym/hyponym hierarchy. Higher similarity is indicated by a closer common ancestor.
Discuss in general, how the use of heuristic functions improve
on the efficiency of the blind search process.
Heuristic functions improve the efficiency of blind search processes by providing guidance on which paths to follow, thus reducing the search space. While blind search methods like Depth-First or Breadth-First Search explore paths indiscriminately, heuristic functions estimate the “closeness” to the goal, allowing the search to prioritize more promising paths. This targeted approach often leads to faster discovery of the goal state without exhaustively searching all possibilities, significantly improving computational efficiency and reducing time complexity.
Describe the difference between the ordinary search process
used to solve problems and adversary search. Why are these
differences relevant and under what circumstances?
Ordinary search processes, used in solving standard problems, involve finding a path or solution in a defined space, where the outcome is determined solely by the searcher’s actions. Adversary search, on the other hand, is used in competitive settings like games, where the outcome depends on the actions of multiple agents or players, often with opposing objectives.
These differences are relevant in contexts where an “opponent” or adversarial element affects the decision-making process. In ordinary search, the focus is on overcoming environmental challenges, while in adversary search, the focus is on outmaneuvering an opponent who actively counters the searcher’s moves. Adversary search is crucial in games like chess or checkers, where each player must anticipate and react to the other’s strategies.
Describe in detail one specific heuristic function and its use in
some relevant problem domain. You may choose any problem,
from either ordinary search or adversary search problems.
Discuss any good and bad points on the heuristic function you
have described. (Topics may be selected from, but are not
restricted to, any one of the following: Farmer-Fox-Goose
problem, 8 or 16 Tile puzzle, Blast, OXO, Chess, Draughts etc.)
Heuristic Function: Manhattan Distance in the context of the 8-tile puzzle.
Description:
The Manhattan Distance heuristic calculates the total distance each tile is from its goal position, using only vertical and horizontal moves. For each tile, the distance is the sum of the horizontal and vertical distances to its target position.
In the 8-tile puzzle, where the goal is to move tiles within a grid to reach a specific end configuration, this heuristic estimates the minimum number of moves required to get each tile to its correct position.
Good Points:
Admissibility: The Manhattan Distance never overestimates the cost to reach the goal, making it an admissible heuristic for A* search.
Efficiency: It provides a good balance between accuracy and computational simplicity, guiding the search effectively without significant processing overhead.
Bad Points:
Not Always Perfectly Accurate: It assumes tiles can move independently without considering the interactions between them, which can sometimes lead to underestimating the true number of moves required.
Does Not Account for Sequential Constraints: In scenarios where tiles block each other, the Manhattan Distance might not accurately reflect the complexity of rearranging the tiles.
In summary, while the Manhattan Distance heuristic is efficient and generally effective for solving the 8-tile puzzle, its simplifications can sometimes lead to suboptimal performance in more complex scenarios.
Briefly describe how MiniMax search improves the efficiency of
adversary search.
MiniMax search improves the efficiency of adversary search by systematically exploring the game tree to identify the optimal move, assuming both players act optimally. It does this by alternating between maximizing the player’s advantage and minimizing the opponent’s advantage at each level of the tree. This approach narrows down the choices to the most strategically sound moves, thus reducing the need to evaluate every possible move in games like chess or tic-tac-toe. It allows for a focused and effective search strategy in competitive, turn-based games.
How is the Nearest Neighbours approach used to solve
problems? How does k-Nearest Neighbours approach improve
upon the 1-Nearest Neighbours approach? You may make use
of an example to illustrate your answer.
The Nearest Neighbours approach is used to solve problems, particularly in classification and regression, by finding the closest data points (neighbours) to a query point and making predictions based on these neighbours. In the 1-Nearest Neighbour approach, the prediction is based solely on the single closest data point.
The k-Nearest Neighbours (k-NN) approach improves upon this by considering the ‘k’ closest data points, where ‘k’ is a predefined number greater than 1. This allows for a more robust prediction as it reduces the impact of noise and outliers present in the dataset. For example, in a classification task, k-NN classifies a new point based on the majority class among its ‘k’ nearest neighbours, providing a more representative and reliable decision than basing the classification on a single neighbour.
Discuss the use categories in WordNet. What relationships are
used to impose a structure on the WordNet hierarchy? Make
use of specific examples in your answer.
In WordNet, categories are represented by sets of synonyms called synsets, each expressing a distinct concept. The relationships used to impose structure on the WordNet hierarchy include:
Hypernyms: These indicate a general category or class to which a term belongs. For example, the word “canine” is a hypernym of “dog.”
Hyponyms: These signify more specific instances of a general category. For example, “dog” is a hyponym of “canine.”
Meronyms: These denote parts of a whole. For instance, “wheel” is a meronym of “car.”
Holonyms: These represent the whole of which a term is a part. For example, “car” is a holonym of “wheel.”
These relationships create a hierarchical structure in WordNet, allowing navigation from more general to more specific terms and vice versa, and understanding the part-whole relationships among terms.
How is the nearest neighbours approach used to solve problems?
How does K nearest neighbours improve upon the 1 nearest
neighbours approach?
The nearest neighbours approach solves problems, especially in classification and regression, by identifying the closest data point(s) to a query point and making predictions based on these neighbouring points.
The K nearest neighbours (K-NN) approach improves upon the 1 nearest neighbour method by considering ‘k’ closest data points instead of just one. This makes the prediction more robust, as it reduces the influence of noise and outliers. For example, in classification, K-NN classifies a query point based on the majority vote of its ‘k’ nearest neighbours, providing a more reliable decision than relying on a single neighbour.
Describe in detail how the nearest neighbours approach can be
used to estimate the sale value of a house. You may assume you
have access to a large database of recent house sale records –
detailing all relevant details of each sales transaction.
The nearest neighbours approach can estimate the sale value of a house by analyzing a database of recent house sale records. Here’s how it can be used:
Feature Selection: Choose relevant features from the database that influence house prices, such as location, size, number of bedrooms, age of the house, and amenities.
Find Nearest Neighbours: For a house to be valued, find ‘k’ houses in the database that are most similar to it based on the chosen features. Similarity is typically measured using a distance metric like Euclidean distance.
Average Sale Values: Calculate the average sale value of these ‘k’ nearest houses. This average is used as the estimated sale value of the house in question.
This method assumes that houses similar in key characteristics will have similar sale values, making it a practical tool for real estate valuation.
Describe in detail how you would go about the task of building up a
large index of documents, such as that used by internet search
engines. Describe how anchor text can be used to index pages
under terms that do not even appear on the listed page. Make use
of appropriate examples to illustrate your answer.
Crawling: Use web crawlers or spiders to systematically browse the web and gather documents or web pages.
Parsing and Processing: Extract relevant information from these documents, including text, links, and metadata. This step involves tasks like removing HTML tags and dealing with different formats and languages.
Indexing: Create an index to store and retrieve information efficiently. This usually involves creating a database of words and their occurrences in documents. Data structures like inverted indices are commonly used, mapping terms to their locations in documents.
Use of Anchor Text: Anchor text, the clickable text in a hyperlink, is crucial for indexing. Search engines use anchor text to index pages under terms that might not appear on the page itself. For example, if many sites link to a certain webpage with the anchor text “best coffee recipes,” the search engine might index that page under “coffee recipes” even if those exact words don’t appear on the page.
Ranking Algorithm: Develop algorithms to rank the indexed pages based on relevance and authority. This often involves complex algorithms like PageRank, which considers factors such as the number and quality of inbound links to a page.
Updating: Regularly update the index to include new pages, remove obsolete ones, and refresh existing ones to reflect content changes.
Briefly characterize the difference between ordinary heuristic
search and adversary search. In what way does the MiniMax
search algorithm cater for this difference in adversary search
problems?
Ordinary heuristic search is used in problems where the outcome depends solely on the choices of a single agent in a static environment. In contrast, adversary search is used in competitive scenarios where multiple agents (like players in a game) make decisions that affect each other, often with opposing goals.
The MiniMax search algorithm caters to adversary search by considering the strategy of both players. It alternates between maximizing the advantage for the player whose turn it is (the maximizer) and minimizing it for the opponent (the minimizer). This approach allows for predicting the opponent’s best possible moves and choosing the best counter-strategy, making it well-suited for two-player games like chess or tic-tac-toe.
D value of 2.
Briefly outline how α-β pruning further improves the efficiency of
the MiniMax algorithm. Use a diagram to illustrate your answer.
Alpha-Beta (α-β) pruning improves the efficiency of the MiniMax algorithm by reducing the number of nodes it needs to evaluate in the game tree. It does this by pruning (cutting off) branches that cannot possibly influence the final decision.
The process involves two parameters, alpha (α) and beta (β), which represent the minimum score that the maximizing player is assured of and the maximum score that the minimizing player is assured of, respectively. When the algorithm determines that a certain branch cannot improve the outcome for the current player (when α ≥ β), it stops exploring that branch, as it will not affect the final decision. This pruning significantly reduces the number of nodes visited, leading to faster decision-making in games.
Briefly describe how N-grams are used in the parsing process.
How do they distinguish between valid and invalid syntactic forms?
N-grams are used in the parsing process as a statistical tool to predict the likelihood of a sequence of words or characters. An N-gram model uses the probability of a word or character occurring based on the occurrence of its preceding N-1 words or characters.
In parsing, N-grams help distinguish between valid and invalid syntactic forms by analyzing the probability of word sequences. Sequences that occur more frequently in the training corpus are considered more likely to be syntactically valid. Conversely, sequences with low probability, which rarely or never appear in the training data, are more likely to be syntactically invalid.
For example, in a bigram (2-gram) model, the model checks the probability of each word occurring after its immediate predecessor. If a particular word sequence like “the cat” appears often in the training data, but a sequence like “cat the” rarely appears, the model will favor “the cat” as a valid syntactic form over “cat the.”
This statistical approach does not guarantee perfect syntactic validity but provides a practical method for approximating language patterns based on observed data.
Write down the PageRank formula, explaining each of its terms.
Describe the relevance of a random walk to this formula
The PageRank of a page is the chance that
a random surfer will land on that page.
PR(A): PageRank of page A. This is what the algorithm is trying to calculate - the ‘importance’ or ‘authority’ of page A on the web.
(1-d): This is a damping factor which is a small constant (usually set around 0.85). It represents the probability of a random user choosing to follow a random link on a web page. The idea is that a user does not always follow outbound links from a given webpage and sometimes jumps to a random page.
d: This is the complementary probability to the damping factor (1-d). It represents the probability of a user following an outbound link from a page. It’s a way to model the behavior of a user sticking to a chain of links.
PR(Ti): PageRank of pages Ti which link to page A. These are the pages that have outbound links pointing to page A. Their PageRank contributes to the PageRank of page A.
C(Ti): The number of outbound links on page Ti. This is used to divide the PageRank of Ti. The idea is that the ‘vote’ of a page is split between all the links it contains; the more links, the less ‘voting power’ each link has.
The sum: The formula sums the contributions of all pages linking to page A. Each contribution is the PageRank of the linking page divided by the number of links on that page.
Describe how you would use a heuristic function to identify a solution
to the 16-tile puzzle. Describe your heuristic function in detail.
For the 16-tile puzzle, a heuristic function can be used to estimate the cost (or steps) remaining to reach the goal state from the current state. A popular and effective heuristic is the “Manhattan Distance” heuristic:
Manhattan Distance: This heuristic calculates the total distance each tile is from its goal position, considering only vertical and horizontal moves. For each tile, the distance is the sum of the horizontal and vertical distances to its correct location.
Implementation: For each tile in the current state, compute how many rows and columns it is away from its target position in the goal state. Sum these distances for all tiles (except the blank or zero tile).
Advantages: The Manhattan Distance heuristic is admissible (never overestimates the true cost to reach the goal) and consistent, making it suitable for use in A* or similar search algorithms. It’s effective because it gives a clear idea of the minimum number of moves needed to get each tile into its correct position.
This heuristic helps guide the search algorithm by favoring moves that bring tiles closer to their correct positions, thereby leading to a more efficient search for the solution.
D value of 6.
Briefly describe α-β pruning. How does it improve upon MiniMax
search? Make use of an appropriate example to illustrate your
answer.
Alpha-Beta (α-β) pruning improves MiniMax search by eliminating branches in the game tree that don’t affect the final decision. It uses two parameters, alpha (α) and beta (β), to track the minimum score the maximizer is assured and the maximum score the minimizer can achieve. If a move is found to be worse than a previously evaluated move, further exploration of that branch is unnecessary and is pruned. This reduces the number of nodes the algorithm needs to evaluate, significantly enhancing efficiency without compromising the accuracy of the result.
What is meant by a random walk of a graph? In what way does
this model the behaviour of some “random surfer” and why is
this useful?
A random walk of a graph is a process where steps are taken from one vertex to another randomly, following the edges of the graph. This models the behavior of a “random surfer” on the internet, who clicks on links and moves from one webpage to another without a specific goal.
This concept is useful, particularly in algorithms like Google’s PageRank, as it simulates how an average user might navigate the web. It helps in determining the importance or ranking of webpages based on the probability of landing on them through random navigation. Pages frequently visited in random walks are deemed more important, thus are ranked higher in search engine results.
Write down the PageRank formula, explaining each of its terms.
Outline the significance of the damping factor (d) to the random
walk.
The PageRank of a page is the chance that
a random surfer will land on that page.
PR(A): PageRank of page A. This is what the algorithm is trying to calculate - the ‘importance’ or ‘authority’ of page A on the web.
(1-d): This is a damping factor which is a small constant (usually set around 0.85). It represents the probability of a random user choosing to follow a random link on a web page. The idea is that a user does not always follow outbound links from a given webpage and sometimes jumps to a random page.
d: This is the complementary probability to the damping factor (1-d). It represents the probability of a user following an outbound link from a page. It’s a way to model the behavior of a user sticking to a chain of links.
PR(Ti): PageRank of pages Ti which link to page A. These are the pages that have outbound links pointing to page A. Their PageRank contributes to the PageRank of page A.
C(Ti): The number of outbound links on page Ti. This is used to divide the PageRank of Ti. The idea is that the ‘vote’ of a page is split between all the links it contains; the more links, the less ‘voting power’ each link has.
The sum: The formula sums the contributions of all pages linking to page A. Each contribution is the PageRank of the linking page divided by the number of links on that page.
Briefly outline the generate-and-test approach to problem
solving. How does heuristic search compare with blind search,
when finding solutions to problems?
The generate-and-test approach involves creating potential solutions and testing them until a satisfactory one is found. Heuristic search uses rule-of-thumb strategies to guide the search for solutions efficiently, while blind search explores all possibilities systematically without guidance, potentially taking more time and resources.
How would you adopt the generate-and-test approach to finding
solution(s) to the 16-tile puzzle? Describe your heuristic function
in detail. Make use of an appropriate example to illustrate your
answer.
To solve the 16-tile puzzle using the generate-and-test approach:
Generate possible moves.
Use the Manhattan distance heuristic to estimate how close each state is to the goal.
Select the state with the lowest heuristic value.
Repeat steps 1-3 until the goal state is reached.
For example, if tile 1 is moved up, we calculate the Manhattan distance for each move and choose the one with the lowest value to make the next move.
D value of -2.
Using an example or otherwise, illustrate the difference
between the surface structure and deep structure of a
sentence. What is the relevance of phrase structured grammars
to deep structure?
Surface structure pertains to the grammatical form of a sentence, while deep structure relates to its underlying meaning. Phrase-structured grammars are important for understanding deep structure by dissecting the sentence’s syntactic structure to uncover its core semantic content.
Describe the operation of an heuristic search process. In what way
does the heuristic function improve the search process.
A heuristic search process is a problem-solving technique that uses heuristic functions to guide the search for solutions efficiently. It works by evaluating each potential solution based on the heuristic function’s estimate of its desirability or proximity to the goal. The heuristic function improves the search process by providing a “rule of thumb” to prioritize the most promising solutions, reducing the search space and making it more efficient. This helps focus the search on paths that are more likely to lead to the desired solution, saving time and resources compared to blind search methods.
Describe in detail the specific heuristic you would use to help solve
the 15/16 tile puzzle. Make reference to the specific features of the
board-state that you would look for and how you might identify them.
(You do not need to write code, but code should be very easy to write
from your description of the heuristic). Justify your heuristic.
Manhattan Distance: For each tile, calculate the distance from its current position to its goal position in terms of the number of moves (up, down, left, right). The sum of these distances for all tiles provides a heuristic estimate of the minimum number of moves needed to solve the puzzle.
Linear Conflict: This adds to the Manhattan Distance. If two tiles are in their goal row or column but are in the wrong order, they are in a linear conflict. This conflict means additional moves are required. For each pair of conflicting tiles, add 2 to the heuristic (since at least two moves are needed to resolve the conflict).
Identify these features as follows:
For Manhattan Distance, calculate the absolute difference in the row and column indices of a tile’s current and goal positions.
For Linear Conflict, check each row and column. If two tiles are in their goal row or column but out of order, count this as a conflict.
Justification:
The Manhattan Distance is a lower bound of the moves needed, as it assumes no other tiles are in the way.
The Linear Conflict accounts for the fact that tiles can block each other, making it a more accurate and often more efficient heuristic.
This heuristic is admissible (never overestimates the cost to reach the goal) and takes into account both the individual tile positions and their relative positions, leading to efficient puzzle solving.
Describe MiniMax differs from ordinary heuristic guided search.
MiniMax is a decision-making algorithm commonly used in two-player, zero-sum games like chess and tic-tac-toe. It differs from ordinary heuristic-guided search by considering both players’ strategies. MiniMax evaluates possible moves by assuming that one player aims to maximize their outcome while the other aims to minimize it. It recursively explores game states by alternating between maximizing and minimizing players, selecting the move that leads to the best possible outcome for the maximizing player. In contrast, ordinary heuristic-guided search uses heuristics to evaluate positions without considering the opponent’s counter-strategy and doesn’t necessarily guarantee optimal decisions in adversarial settings.
C value of 2.
What is a term-document index? Describe the structure of a term
document index such as might be used for the WWW. In particular,
discuss some of the information that you might like this index to
contain.
A term-document index is a data structure used in information retrieval systems to efficiently store and retrieve information about the relationships between terms (words or phrases) and documents (web pages, articles, etc.). In the context of the World Wide Web (WWW), a term-document index would contain information like:
Term Frequencies: For each term, it would store the number of times that term appears in each document, helping to assess the relevance of documents to a search query.
Inverse Document Frequencies (IDF): It would include IDF scores for each term, which represent how unique or common a term is across all documents. This helps in ranking the importance of terms in a document collection.
Document Metadata: Information about each document, such as its title, URL, publication date, and author, to provide context and facilitate document retrieval.
Term Positions: Optionally, it might store the positions of terms within documents, which can be useful for proximity-based searches.
Links: Information about hyperlinks between documents, allowing for the analysis of link structures and the development of algorithms like PageRank.
Query History: It may also track user search queries and click-through patterns to improve search relevance and user experience.
A well-structured term-document index enhances the efficiency and accuracy of information retrieval on the web, enabling search engines to quickly identify and present relevant documents to users based on their queries
What is meant by the joint probability of a word sequence? Briefly
state the problem that novel words present when calculating the joint
probability of a word sequence? Briefly outline one technique that can
overcome this problem with novel words.
The joint probability of a word sequence refers to the likelihood of a particular sequence of words occurring together in a given context or document. The problem with novel words is that they are words not previously encountered in the training data, making it challenging to estimate their probabilities accurately using traditional language models. One technique to address this issue is to use smoothing methods like Laplace smoothing or add-one smoothing, which assign a small, non-zero probability to unseen words. This helps prevent zero probabilities for novel words and improves the overall robustness of language models when calculating joint probabilities.
Detail each of the steps of the A* search algorithm. Briefly state
what is the objective behind the A* algorithm?
Initialization: Start with an open list containing only the initial node and a closed list that is empty.
Loop: Repeat the following steps until the open list is empty:
Choose Node: Select the node with the lowest score from the open list. The score is calculated as
f(n)=g(n)+h(n), where
g(n) is the cost to reach the node, and
h(n) is the estimated cost from the node to the goal (heuristic).
Goal Check: If the chosen node is the goal, backtrace the path from the goal to the start and return this path.
Generate Successors: Identify all the neighbors of the current node and calculate their scores.
Update Lists: For each neighbor, if it is not in the open list, add it. If it is already in the open or closed list with a higher score, update it with the lower score and change its parent to the current node.
Move Node: Move the current node from the open list to the closed list.
Completion: The algorithm terminates when either the goal is found (and the path is returned) or the open list is empty (indicating no path).
What is adversary search and how does it differ from (ordinary)
heuristic search?
Adversary search is a problem-solving technique commonly used in two-player, zero-sum games like chess or tic-tac-toe. It involves considering the strategies of both players, one trying to maximize their outcome while the other aims to minimize it. Adversary search algorithms, such as MiniMax, explore potential game states by alternating between maximizing and minimizing players to find the optimal move for one player while taking into account the counter-strategy of the opponent.
In contrast, ordinary heuristic search focuses on finding optimal solutions in domains without a competing adversary. It uses heuristic functions to guide the search by estimating the desirability of each state, aiming to reach a goal state efficiently. Unlike adversary search, ordinary heuristic search does not consider an opponent’s strategy and is typically used in single-player problem-solving scenarios.
C with a value of 1.
Briefly state what is the Turing test? What is the significance of
the Chinese Room argument in relation to artificial intelligence?
In what way if any, do CAPTCHA relate to the Turing test?
The Turing test is a test of a machine’s ability to exhibit human-like intelligence in natural language conversation, where a human evaluator interacts with both a machine and a human without knowing which is which.
The Chinese Room argument, proposed by John Searle, challenges the notion of true understanding in artificial intelligence systems by depicting a scenario where a person inside a room follows instructions to manipulate symbols in a foreign language without actually comprehending it.
CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) relates to the Turing test as it serves as a practical application of the Turing test’s principles in online security by distinguishing between humans and automated programs through challenges that are easy for humans but difficult for machines.
What is meant by the joint probability of a word sequence? Why
does approximating the joint probability of a sequence often
produce better results than calculating the actual probability?
Make use of an appropriate example to illustrate your answer.
Joint probability of a word sequence refers to the likelihood of a specific sequence of words occurring together in a given context or document.
Approximating the joint probability often produces better results because calculating the actual probability requires a vast amount of training data and can lead to extremely low probabilities for longer sequences due to the multiplication of individual word probabilities.
For example, consider the sentence “The quick brown fox jumps over the lazy dog.” Calculating the actual joint probability of this sequence by multiplying individual word probabilities can lead to exceedingly small values, making it hard to differentiate between sequences. Approximating allows for more manageable and informative probabilities, improving the effectiveness of language models and natural language processing tasks.
What is tf-idf? Describe how this term can influence the results
returned by a search engine. Illustrate your answer with a brief
example.
TF-IDF stands for Term Frequency-Inverse Document Frequency.
It is a numerical statistic used in information retrieval and text mining to evaluate the importance of a term within a document relative to a collection of documents.
TF measures how frequently a term appears in a document, while IDF assesses how unique or common a term is across the entire document collection.
Higher TF-IDF scores indicate that a term is both frequent in a document and relatively rare across the collection, making it more important.
For example, in a search engine, if a user queries “machine learning,” the search engine calculates the TF-IDF scores for documents containing these terms. A document discussing machine learning extensively (high TF) but not mentioned frequently in other documents (high IDF) will have a high TF-IDF score, indicating it’s highly relevant and should be prioritized in search results.
Outline an algorithm to interpret the analogy between two given
concepts. Make reference to any specific representation
requirements of your algorithm. Illustrate your answer by use of
an appropriate example.
Algorithm to Interpret Analogy:
Input: The algorithm takes two pairs of concepts, A (A1, A2) and B (B1, B2), where A1:A2 :: B1:?, and it aims to find the missing concept B2.
Representation Requirements: Concepts must be represented in a way that allows for comparison. This could be through vector embeddings or symbolic representations.
Step 1: Find Concept Vectors: If using vector embeddings, represent A1, A2, and B1 as vectors in a high-dimensional space.
Step 2: Calculate Analogy Vector: Calculate the analogy vector as (A2 - A1) + B1.
Step 3: Find Nearest Neighbor: Search for the concept in the dataset that is closest to the analogy vector using cosine similarity or another similarity measure. This concept represents B2.
Example:
A1: Man
A2: Woman
B1: King
We want to find B2, which corresponds to “Queen.”
Represent “Man” and “Woman” as vectors in a word embedding space.
Calculate the analogy vector: (Woman - Man) + King = Queen.
Search for the nearest concept to the “Queen” vector, which returns “Queen” as B2.
This algorithm interprets analogies by leveraging vector representations to identify the missing concept in the analogy, demonstrating how word embeddings can capture semantic relationships.
What is the Turing test? Discuss the relevance of the Chinese
Room argument in relation to the Turing test? What is the
relevance, if any, of CAPTCHA to the Turing test?
Turing Test:
Proposed by Alan Turing in 1950.
A test of machine intelligence.
Involves a human evaluator interacting with both a machine and a human, trying to determine which is which based on their responses.
If the evaluator cannot reliably distinguish the machine from the human, the machine is said to have passed the Turing test, demonstrating human-like intelligence.
Relevance of the Chinese Room Argument:
The Chinese Room argument, proposed by John Searle, challenges the idea that passing the Turing test equates to true understanding or consciousness.
In the Chinese Room scenario, someone inside a room follows instructions to manipulate Chinese symbols, producing coherent responses without understanding Chinese.
This argument suggests that purely symbolic or syntactic manipulation, as in traditional AI, may not imply true comprehension or consciousness.
It raises questions about whether the Turing test truly measures machine intelligence or merely surface-level behavior.
Briefly state the benefits offered by the A* algorithm, when
compared to simpler search processes? List and describe each
step of the A* search algorithm.
Initialization: Start with an open list containing only the initial node and a closed list that is empty.
Loop: Repeat the following steps until the open list is empty:
Choose Node: Select the node with the lowest score from the open list. The score is calculated as
f(n)=g(n)+h(n), where
g(n) is the cost to reach the node, and
h(n) is the estimated cost from the node to the goal (heuristic).
Goal Check: If the chosen node is the goal, backtrace the path from the goal to the start and return this path.
Generate Successors: Identify all the neighbors of the current node and calculate their scores.
Update Lists: For each neighbor, if it is not in the open list, add it. If it is already in the open or closed list with a higher score, update it with the lower score and change its parent to the current node.
Move Node: Move the current node from the open list to the closed list.
Completion: The algorithm terminates when either the goal is found (and the path is returned) or the open list is empty (indicating no path).
How does adversary search differ from standard (ordinary)
heuristic search? What impact does this difference on the
search mechanism used to solve adversary search problems?
Adversary Search vs. Standard Heuristic Search:
Adversary Search:
Used in two-player, zero-sum games.
Considers strategies of both players.
One player maximizes while the other minimizes.
Examples: Chess, Tic-Tac-Toe.
Standard Heuristic Search:
Used in single-player problem-solving.
Focuses on finding optimal solutions.
Uses heuristics to estimate desirability.
Examples: Route planning, puzzle solving.
Impact on Search Mechanism:
Adversary Search:
Alternates between maximizing and minimizing players.
Typically employs the MiniMax algorithm.
Requires evaluating possible outcomes of both players’ moves.
Aims to find the best move for the maximizing player while considering the opponent’s counter-strategy.
Standard Heuristic Search:
Focuses on finding the optimal solution.
Utilizes heuristics to guide the search.
Typically uses algorithms like A* or IDA*.
Doesn’t involve an adversary; its goal is to reach a desired state efficiently without considering an opponent’s strategy.
D with a value of 3.
What is a random walk of the World Wide Web (WWW)? What
is the relevance of the random walk algorithm to the ranking of
web documents?
Random Walk of the WWW:
A random walk of the World Wide Web involves navigating the web by following hyperlinks from one web page to another in a random or probabilistic manner.
It simulates a user browsing the web by making random decisions to click on links and visit different web pages.
Relevance of the Random Walk Algorithm to Ranking Web Documents:
The random walk algorithm, particularly the PageRank algorithm developed by Larry Page and Sergey Brin, plays a crucial role in ranking web documents.
PageRank assigns importance scores to web pages based on their connectivity and the links they receive from other authoritative pages.
Pages with higher PageRank scores are considered more relevant and authoritative, impacting their position in search engine rankings.
It helps search engines like Google provide more accurate and valuable search results by prioritizing pages with higher PageRank scores.
What is tf-idf? How does it help identify those web-pages that
are most relevant to a given web query.
TF-IDF (Term Frequency-Inverse Document Frequency):
A numerical statistic used in information retrieval and text mining.
Measures the importance of a term within a document relative to a collection of documents.
Combines two components: TF (Term Frequency) and IDF (Inverse Document Frequency).
How TF-IDF Helps Identify Relevant Web Pages:
Term Frequency (TF):
Measures how frequently a term appears in a document.
High TF indicates the term’s significance in the document.
Inverse Document Frequency (IDF):
Measures how unique or common a term is across the entire document collection.
High IDF indicates a term’s rarity and importance.
Combining TF and IDF:
TF-IDF calculates a score for each term in a document.
Terms with high TF and high IDF are considered more relevant.
Relevance to Web Queries:
When a user enters a web query, search engines calculate the TF-IDF scores of terms in both the query and web pages.
Web pages with higher TF-IDF scores for query terms are considered more relevant.
This helps search engines identify and rank web pages that best match the user’s query by considering both term frequency within documents and term rarity across the entire web collection.
Describe what is meant by heuristic search. Describe in detail a
specific heuristic to efficiently solve one of the following
problems: 15-tile puzzle, Farmer-Fox-Goose problem, Sudoku
puzzle or other relevant puzzle. Make use of specific examples
of relevant search states in your answer.
Heuristic Search:
Problem-solving technique using heuristic functions to guide search.
Heuristics estimate the desirability of states to reach a goal efficiently.
15-Tile Puzzle Heuristic: Manhattan Distance:
Heuristic function: Sum of distances each tile is from its goal position horizontally and vertically.
Outline how the A* search algorithm operates. In your answer,
address how (and why) A* imposes a tree structure on its
solution space.
Initialization: Start with an open list containing only the initial node and a closed list that is empty.
Loop: Repeat the following steps until the open list is empty:
Choose Node: Select the node with the lowest score from the open list. The score is calculated as
f(n)=g(n)+h(n), where
g(n) is the cost to reach the node, and
h(n) is the estimated cost from the node to the goal (heuristic).
Goal Check: If the chosen node is the goal, backtrace the path from the goal to the start and return this path.
Generate Successors: Identify all the neighbors of the current node and calculate their scores.
Update Lists: For each neighbor, if it is not in the open list, add it. If it is already in the open or closed list with a higher score, update it with the lower score and change its parent to the current node.
Move Node: Move the current node from the open list to the closed list.
Completion: The algorithm terminates when either the goal is found (and the path is returned) or the open list is empty (indicating no path).
Describe how the A* algorithm might be adapted to help in pathfinding (path planning) for a real or simulated mobile robot. Make
reference to how the A* algorithm performs optimally
Adapting A for Pathfinding in Mobile Robots:*
A* is a popular choice for path planning in mobile robotics as it combines the advantages of both breadth-first and greedy best-first search algorithms.
Adaptations for mobile robots:
Define a grid-based or graph-based representation of the robot’s environment where nodes represent discrete locations or states.
Implement a heuristic function that estimates the cost from the current robot position to the goal.
Incorporate movement constraints, such as obstacles and permissible directions.
Utilize a priority queue to explore states with lower estimated costs first.
Optimality of A:*
A* guarantees optimality when:
The heuristic is admissible (never overestimates true costs).
The robot moves in discrete steps with a finite number of states.
The cost function satisfies the triangle inequality (c(x, y) + h(y) ≤ h(x), where c is the cost, and h is the heuristic).
A* maintains two lists: one for open nodes to be explored and one for closed nodes that have been evaluated.
It expands nodes with the lowest cost (g(n) + h(n)) first, where g(n) is the cost to reach the node, and h(n) is the heuristic cost to reach the goal from the node.
By considering both the cost to reach a state and the estimated cost to the goal, A* explores the most promising paths early, ensuring an optimal solution when the conditions are met.
Adapting A* for path planning in mobile robots involves incorporating the robot’s specific movement capabilities and environmental constraints into the search algorithm while maintaining A*’s optimality under appropriate conditions.
In what way does adversary search differ from traditional
heuristic search processes? Illustrate your answer with
reference to appropriate examples.
Adversary Search vs. Traditional Heuristic Search:
Adversary Search:
Used in two-player games (e.g., chess).
Considers both players’ strategies.
Examples: Chess, Tic-Tac-Toe.
Traditional Heuristic Search:
Used in single-player problem-solving.
Focuses on finding optimal solutions.
Examples: Route planning, puzzle solving.
Illustration: Chess - Adversary Search vs. Traditional Heuristic Search:
Adversary Search: In chess, considers White’s and Black’s moves, optimizing for one player while considering the opponent’s counter-strategy.
Traditional Heuristic Search: In chess, aims to find the best move for one player independently without considering an opponent’s response.
Write down the PageRank formula, explaining each of its terms.
Briefly comment on the relevance of iterative convergence to the
calculations of this formula.
The PageRank of a page is the chance that
a random surfer will land on that page.
PR(A): PageRank of page A. This is what the algorithm is trying to calculate - the ‘importance’ or ‘authority’ of page A on the web.
(1-d): This is a damping factor which is a small constant (usually set around 0.85). It represents the probability of a random user choosing to follow a random link on a web page. The idea is that a user does not always follow outbound links from a given webpage and sometimes jumps to a random page.
d: This is the complementary probability to the damping factor (1-d). It represents the probability of a user following an outbound link from a page. It’s a way to model the behavior of a user sticking to a chain of links.
PR(Ti): PageRank of pages Ti which link to page A. These are the pages that have outbound links pointing to page A. Their PageRank contributes to the PageRank of page A.
C(Ti): The number of outbound links on page Ti. This is used to divide the PageRank of Ti. The idea is that the ‘vote’ of a page is split between all the links it contains; the more links, the less ‘voting power’ each link has.
The sum: The formula sums the contributions of all pages linking to page A. Each contribution is the PageRank of the linking page divided by the number of links on that page.
How can a heuristic function be combined with a simple iterative
search process to generate solutions to seemingly complex
problems? Focus your answer on the role played by the heuristic
function with a search process.
Heuristic function estimates state or action desirability.
Guides iterative search by prioritizing states.
Enhances efficiency in complex problems.
Example: In the Traveling Salesman Problem, estimates distances to unvisited cities, guiding the search towards shorter routes.
Crucial for focusing search on promising solutions.
List each of the steps of the A* algorithm. If you wish, you may
use diagrams to illustrate your answer.
Initialization: Start with an open list containing only the initial node and a closed list that is empty.
Loop: Repeat the following steps until the open list is empty:
Choose Node: Select the node with the lowest score from the open list. The score is calculated as
f(n)=g(n)+h(n), where
g(n) is the cost to reach the node, and
h(n) is the estimated cost from the node to the goal (heuristic).
Goal Check: If the chosen node is the goal, backtrace the path from the goal to the start and return this path.
Generate Successors: Identify all the neighbors of the current node and calculate their scores.
Update Lists: For each neighbor, if it is not in the open list, add it. If it is already in the open or closed list with a higher score, update it with the lower score and change its parent to the current node.
Move Node: Move the current node from the open list to the closed list.
Completion: The algorithm terminates when either the goal is found (and the path is returned) or the open list is empty (indicating no path).
What do you think are the most significant differences between
(ordinary) heuristic search and adversary search? Justify your
answer.
Significant Differences Between Heuristic Search and Adversary Search:
Objective:
Heuristic Search: Seeks optimal solutions or states in single-player problems.
Adversary Search: Focuses on competing with an adversary to maximize or minimize outcomes in two-player games.
Players’ Strategies:
Heuristic Search: Assumes no opponent; it’s a single-player search process.
Adversary Search: Considers strategies of both players, one aiming to maximize and the other to minimize outcomes.
Algorithm Types:
Heuristic Search: Uses algorithms like A* or hill climbing.
Adversary Search: Utilizes MiniMax, alpha-beta pruning, or other game-specific algorithms.
Scenarios:
Heuristic Search: Applicable to a wide range of single-player problems like route planning, puzzle solving.
Adversary Search: Specifically designed for two-player, zero-sum games such as chess, checkers, or tic-tac-toe.
Justification:
The primary difference lies in their objectives and the consideration of opponent strategies. Heuristic search aims for optimal solutions in single-player problems, while adversary search deals with competitive scenarios in two-player games. This distinction leads to differences in algorithms, players’ strategies, and problem domains where each is most applicable.
Discuss how you would go about the process of measuring the
accuracy of an information retrieval process.
To measure the accuracy of an information retrieval process:
Define what constitutes “relevant” documents.
Choose metrics like precision, recall, and F1-score.
Create a test collection with queries and relevance judgments.
Run the retrieval system on the test collection.
Calculate and analyze the chosen metrics.
Iterate to refine the system and report findings.
What role does the document-term index play in information
retrieval on the web?
Document-Term Index in Information Retrieval:
Role:
The document-term index, often referred to as an inverted index, plays a central role in information retrieval on the web.
It serves as a data structure that efficiently stores and manages the mapping between terms (words or phrases) and the documents in which those terms appear.
Functions:
Efficient Searching: It enables quick and efficient searching for documents containing specific terms, reducing the search time significantly.
Ranking Relevance: The index can be used to calculate relevance scores for documents based on the frequency and location of terms, which aids in ranking search results.
Filtering: It supports the elimination of stop words (common words like “the” or “and”) and stemming (reducing words to their root form) to improve search precision.
Facilitating Retrieval: By maintaining term-document associations, the index helps retrieve relevant documents in response to user queries.
Example:
When you perform a web search, the document-term index is used to quickly identify web pages or documents containing the keywords from your query, enabling the search engine to return relevant results.
In essence, the document-term index is a fundamental component of web search engines, making the retrieval of relevant information from vast amounts of web content fast and efficient.
What is meant tf-idf in the context of information retrieval? Briefly
describe how you would expect the tf-idf value to differ for the
terms “the” and “aardvark” for a collection of English language
web-pages. You may assume that stop words (like “the”) have
not been removed from the document-term index.
TF-IDF in Information Retrieval:
TF-IDF stands for “Term Frequency-Inverse Document Frequency.”
It’s a numerical statistic used in information retrieval to assess the importance of a term within a document relative to a collection of documents.
TF-IDF combines two components: “Term Frequency” (TF), measuring how often a term appears in a document, and “Inverse Document Frequency” (IDF), estimating how unique or common a term is across the entire document collection.
Difference in TF-IDF for “the” and “aardvark”:
For a collection of English language web pages:
“the” is an extremely common word appearing frequently in nearly all documents.
“aardvark” is a relatively uncommon word, appearing in very few documents (if any) compared to “the.”
Expected TF-IDF Difference:
The TF-IDF value for “the” will be very low due to its high term frequency and low document specificity (high IDF).
The TF-IDF value for “aardvark” will be higher than “the” since it’s less frequent and more specific to certain documents (low TF, low IDF).
In summary, the TF-IDF value for “the” will be much lower than that for “aardvark” in a collection of English language web pages, reflecting their differing frequencies and document specificities.
Why does Google’s PageRank algorithm approximate the
behaviour of a theoretical surfer, rather than focusing on the
behaviour of an actual surfer of the web?
Google’s PageRank algorithm approximates the behavior of a theoretical surfer, rather than focusing on the behavior of an actual surfer of the web, for several reasons:
Scalability: The web is vast and constantly changing, with billions of pages and hyperlinks. It’s impractical to track the real-time behavior of every individual surfer due to the sheer volume of web activity.
Privacy: Tracking the actual behavior of web users would raise significant privacy concerns. Collecting and analyzing user data without consent would violate privacy norms and regulations.
Data Availability: Google does not have access to the complete browsing history of individual users. Even if they did, such data would be insufficient to comprehensively analyze and rank all web pages effectively.
Consistency: The behavior of individual surfers varies widely, making it challenging to establish a consistent and reliable ranking system. People’s interests, preferences, and actions online are diverse.
Quality Control: Using the behavior of a theoretical surfer allows for controlled and consistent ranking criteria, making it easier to identify and penalize manipulative practices like link farms and keyword stuffing.
By approximating the behavior of a theoretical surfer, PageRank provides a scalable, privacy-respecting, and consistent method for ranking web pages, which is essential for effective and reliable search engine performance.
Briefly outline the generate-and-test approach to problem
solving. How does heuristic search compare with blind search,
when generating solutions to complex problems?
Generate-and-Test Approach:
In the generate-and-test approach to problem-solving:
Potential solutions are generated systematically.
Each generated solution is then tested to determine if it meets the problem’s criteria or goals.
The process continues until a satisfactory solution is found or the search space is exhausted.
Heuristic Search vs. Blind Search:
Heuristic Search:
Uses heuristics (rules or guiding principles) to prioritize the generation of solutions.
Tends to be more efficient in finding solutions, especially in complex problems, by focusing on promising paths.
Examples include A* search and hill climbing.
Blind Search (or Uninformed Search):
Searches without specific guidance or heuristics.
Tends to explore the search space exhaustively, which can be inefficient in complex problems.
Examples include breadth-first search and depth-first search.
In summary, heuristic search employs heuristics to guide the generation of solutions, making it more efficient and effective in complex problems compared to blind search, which explores the search space without specific guidance.
Describe how you might go about using the A* algorithm to
calculate the navigation path of a NPC (non-player character) in
some ‘first person’ computer game
To use the A* algorithm for NPC navigation in a first-person computer game:
Define the game environment, start, and goal.
Create a grid or graph representing the game world.
Design a heuristic function estimating distances to the goal.
Set up data structures (open and closed lists) for exploration.
Implement A* to find the optimal path.
Reconstruct the path from start to goal.
Use the path to guide NPC movement in the game world, considering dynamic changes and optimizing for real-time performance
How does changing from “ordinary” heuristic search to adversary
search impact on the search processes used to play these
games? What is the relevance of ‘zero sum games’ to adversary
search?
Changing from ordinary heuristic search to adversary search impacts game search processes as follows:
Objective:
Ordinary Heuristic Search: Seeks optimal solutions in single-player problems.
Adversary Search: Focuses on competitive gameplay in two-player, zero-sum games.
Opponent Consideration:
Ordinary Heuristic Search: Assumes no opponent.
Adversary Search: Considers both players’ strategies, one maximizing and the other minimizing outcomes.
Algorithm Type:
Ordinary Heuristic Search: Uses A* or hill climbing.
Adversary Search: Utilizes game-specific algorithms like MiniMax and alpha-beta pruning.
Zero-sum games are relevant to adversary search because they create a balanced framework for competitive gameplay, where one player’s gain corresponds to the opponent’s loss, making it crucial in evaluating strategies and outcomes.
Briefly outline the MiniMax search algorithm.
The MiniMax search algorithm is a decision-making algorithm used in two-player, zero-sum games. Here’s a brief outline of how it works:
Objective: Find the optimal move for a player while considering the opponent’s best responses.
Key Concepts:
Maximizing Player: The player trying to maximize their outcome.
Minimizing Player: The opponent trying to minimize the maximizing player’s outcome.
Game Tree: Represents all possible moves and their consequences, branching out from the current game state.
Terminal Nodes: End points of the game tree, representing game outcomes (win, lose, or draw).
Algorithm Steps:
Start at the current game state.
Generate all possible legal moves for the maximizing player.
For each move, recursively apply the MiniMax algorithm to determine the opponent’s best response.
The maximizing player selects the move that leads to the maximum possible outcome (highest score) based on the opponent’s best responses.
The algorithm continues to explore and evaluate moves until reaching a terminal node.
At terminal nodes, assign scores to reflect the game outcome (e.g., +1 for a win, -1 for a loss, 0 for a draw).
Backpropagate these scores up the tree, alternating between maximizing and minimizing players, until the root node is reached.
The maximizing player chooses the move at the root node with the highest score as the optimal move.
Outcome: MiniMax provides a strategy for the maximizing player to make decisions that maximize their chances of success, considering the opponent’s best responses. It assumes that both players play optimally, resulting in a best-case scenario for the maximizing player and a worst-case scenario for the minimizing player.
Describe any two selection operators. Briefly compare and
contrast their advantages and/or disadvantages.
Here are two selection operators used in genetic algorithms, along with a brief comparison of their advantages and disadvantages:
Roulette Wheel Selection:
Advantages:
Simple and computationally efficient.
Maintains diversity by giving a chance to less fit individuals.
Disadvantages:
Biased towards fitter individuals, leading to premature convergence.
Inefficient for highly imbalanced fitness landscapes.
Tournament Selection:
Advantages:
Control over selection pressure with tournament size.
Provides diversity by allowing less fit individuals to be chosen.
Disadvantages:
Introduces randomness due to random tournament selection.
Smaller tournaments may not exploit high-fitness individuals, while larger ones reduce exploration.
In summary, Roulette Wheel Selection is simple but biased, while Tournament Selection offers control and diversity but introduces randomness. Choice depends on problem characteristics and desired trade-offs.
What is meant by a “random walk” of a graph?
A “random walk” of a graph refers to a stochastic process where a walker moves from one vertex (node) to another within the graph.
Each step in the walk is determined randomly based on the edges connecting the current node to its neighbors.
It can model various scenarios, including exploring networks, simulating diffusion, or analyzing Markov chains.
Random walks help analyze graph properties, like connectivity and centrality, and can be used in algorithms such as PageRank for web page ranking.
What is tf-idf? What is its relevance to the indexing and retrieval
processes of a document search engine?
TF-IDF (Term Frequency-Inverse Document Frequency):
A numerical statistic in information retrieval.
Combines term frequency (TF) and inverse document frequency (IDF) to evaluate a term’s importance in a document within a collection.
Relevance to Indexing and Retrieval:
Indexing: TF-IDF helps create an index by assigning scores to terms in documents.
Retrieval: It aids in ranking documents based on query relevance; documents with higher TF-IDF scores for query terms are considered more relevant.
Balances term frequency (importance within a document) and document frequency (importance across the collection).
Improves search engine precision by identifying relevant documents while reducing the impact of common terms.
Describe how you might use the A* algorithm to plan the
navigation path for a real (or a simulated) mobile robot.
A Algorithm for Robot Path Planning:*
Define the robot’s current position and the goal location.
Create a grid or graph representing the environment with obstacles.
Assign costs to grid cells or graph nodes based on distance to the goal and obstacle avoidance.
Initialize open and closed lists to manage explored and unexplored positions.
Apply the A* algorithm to iteratively expand nodes, prioritizing lower-cost paths.
Generate a path from the start to the goal by backtracking from the goal node to the start node using parent pointers.
Execute robot movements according to the calculated path while avoiding obstacles.
Periodically update the path in dynamic environments or when re-planning is necessary.
Describe, in detail, one specific heuristic function for one of the
following problems: Farmer-Fox-Goose problem, 8 tile puzzle,
15 tile puzzle, Blast-search, OXO, Chess or Draughts. How
would you represent a solution? For what features would your
heuristic return good values? How can these features be
detected from your representation?
Specific Heuristic Function for 15 Tile Puzzle:
Problem Description: In the 15 tile puzzle, you have a 4x4 grid with 15 numbered tiles and one empty space. The goal is to rearrange the tiles from an initial configuration to a target configuration by sliding them into the empty space.
Heuristic Function:
One heuristic function is the “Manhattan Distance”:
Calculate the sum of the Manhattan distances (also called L1 distances) between each tile’s current position and its goal position.
The Manhattan distance for a tile is the sum of the horizontal and vertical distances between its current and goal positions.
Representation of Solution:
The solution is represented as a sequence of moves (up, down, left, right) to transform the initial configuration into the target configuration.
Features for Good Heuristic Values:
The heuristic returns good values based on the following features:
The sum of the Manhattan distances of each tile to its goal position.
The number of misplaced tiles (tiles not in their goal positions).
Detecting Features:
To calculate the Manhattan distances, measure the horizontal and vertical distances between each tile’s current and goal positions.
Count the number of tiles that are not in their goal positions to detect the number of misplaced tiles.
Advantages:
The Manhattan Distance heuristic provides admissible and consistent values, ensuring it never overestimates the number of moves required.
It effectively guides the search towards optimal solutions.
This heuristic function for the 15 tile puzzle is based on measuring the displacement of tiles from their goal positions, allowing it to estimate the remaining effort needed to reach the solution efficiently.
How does the MiniMax search algorithm produce better
solutions to adversarial search problems, than “ordinary”
heuristic search?
MiniMax vs. Ordinary Heuristic Search:
Adversarial Search Focus:
MiniMax specifically designed for two-player, zero-sum games with opponents.
Ordinary heuristic search primarily for single-player optimization problems.
Opponent Consideration:
MiniMax considers the opponent’s actions and minimizes their potential gains.
Ordinary heuristic search assumes no opponent or external factors.
Optimal Strategies:
MiniMax aims to find optimal strategies by minimizing the worst-case scenario for the maximizing player.
Ordinary heuristic search may not focus on game theory concepts like optimal strategies.
Depth of Search:
MiniMax often explores deeper into the game tree, considering various moves and counter-moves.
Ordinary heuristic search may not delve as deep into the search space.
Relevance to Adversarial Games:
MiniMax is specifically tailored for competitive gameplay, making it a better choice for adversarial search problems.
Ordinary heuristic search is better suited for non-competitive, single-player scenarios.
In summary, MiniMax is designed for adversarial games, where considering opponents and finding optimal strategies is crucial. It explores deeper into the game tree, making it more effective in such scenarios compared to ordinary heuristic search.
Describe in detail how you would go about the task of building
up a large index of web documents, starting from just one URL.
Describe each of the main processes and structures used in
your answer.
Web Crawling:
Start with the initial URL.
Use web crawling algorithms to retrieve web pages linked from the initial URL.
Follow hyperlinks to explore additional web pages.
Document Parsing:
Retrieve the content of each web page.
Parse the HTML or other document formats to extract text and metadata (e.g., title, keywords).
Remove HTML tags, scripts, and other non-text elements.
Text Preprocessing:
Tokenize the extracted text into words or phrases.
Remove stop words (common words like “the” and “and”).
Apply stemming or lemmatization to reduce words to their base form.
Indexing:
Create an inverted index data structure.
Map terms (words or phrases) to the documents they appear in.
Store metadata (e.g., document ID, URL) for each indexed document.
Assign term weights using techniques like TF-IDF.
Storing and Managing Documents:
Store the indexed documents in a database or file system.
Keep track of document metadata, such as last modified date.
Implement a mechanism to handle document updates and deletions.
Web Crawler Control:
Implement policies to control the crawling process (e.g., politeness, rate limiting).
Avoid crawling duplicate or already indexed content.
Respect robots.txt files to adhere to website access guidelines.
Scaling and Distributed Computing:
To handle a large volume of web documents, use distributed computing and storage solutions.
Implement load balancing to distribute crawling tasks efficiently.
Error Handling:
Develop error handling mechanisms to deal with broken links, inaccessible websites, and network errors.
Retry policies may be necessary to ensure thorough coverage.
Continuous Updating:
Implement periodic re-crawling and updating of indexed documents to maintain freshness.
Monitor changes in web content and update the index accordingly.
Search Interface:
Create a user-friendly search interface to query the indexed documents.
Implement ranking algorithms to retrieve relevant results.
Using an example or otherwise, outline how heuristic search
can produce superior results to “blind” search.
Heuristic Search vs. Blind Search:
Heuristic Search:
Utilizes heuristics or rules to guide the search towards promising solutions.
Incorporates domain-specific knowledge to estimate the desirability of states.
Can significantly improve search efficiency.
Blind Search (Uninformed Search):
Searches without specific guidance or heuristics.
Explores the search space exhaustively, often leading to inefficiency.
Example: Finding a Path in a Maze:
Heuristic Search (A):*
Utilizes heuristics such as Manhattan distance to estimate the distance from the current state to the goal.
Prioritizes exploring states with lower estimated distances.
Efficiently finds the optimal path while avoiding unnecessary exploration.
Blind Search (Depth-First Search):
Explores paths without guidance, often backtracking and revisiting states.
Inefficient for large mazes, leading to slower convergence and suboptimal paths.
Superior Results of Heuristic Search:
Heuristic search, like A*, uses domain knowledge to make informed decisions.
It focuses on promising paths, avoiding wasted exploration.
This results in faster convergence to optimal or near-optimal solutions compared to blind search in complex problem spaces
Describe in detail one heuristic you would use for the 15-tile
puzzle. Make reference to specific features of the solution
space that your solution method would focus upon.
Heuristic for the 15-Tile Puzzle:
Heuristic Function: The Manhattan Distance Heuristic
Features of the Solution Space Focused Upon:
Tile Misplacement: It calculates the sum of Manhattan distances (L1 distances) between each tile’s current position and its goal position.
Distance Estimation: The heuristic estimates the total distance to move each tile to its goal location.
Localization: It considers the spatial arrangement of tiles and how far they are from their intended positions.
Operation:
For each tile, calculate the Manhattan distance by summing the horizontal and vertical distances between its current position and its goal position.
Sum the Manhattan distances for all tiles in the puzzle.
This sum represents an estimate of the minimum number of moves required to reach the goal state.
The heuristic guides the search algorithm to prioritize moves that reduce the total Manhattan distance.
Advantages:
It provides an admissible heuristic, never overestimating the required moves.
Effectively guides the search towards optimal solutions.
Focuses on tile displacement and alignment with the goal state, which are critical features of the solution space.
The Manhattan Distance Heuristic for the 15-Tile Puzzle is an effective and widely used heuristic that leverages the spatial relationships between tiles to estimate the distance to the goal state accurately, aiding in the efficient exploration of the solution space.
D value of 2.
Outline the main differences between MinMax search and the
addition of - pruning (alpha-beta pruning).
MinMax Search:
Algorithm for two-player, zero-sum games.
Evaluates all possible game states to find the best move.
Considers the entire game tree, which can be computationally expensive.
Guarantees optimal play but can be inefficient for large trees.
Alpha-Beta Pruning:
Enhancement to MinMax search.
Reduces the number of evaluated nodes in the game tree.
Introduces alpha (lower bound) and beta (upper bound) values to cut off branches of the tree that cannot influence the final decision.
Significantly reduces search space, leading to faster and more efficient evaluation.
Preserves the same optimal results as MinMax but with fewer node evaluations.
Compare and contrast the two information retrieval metrics of
Precision and Recall.
Precision and Recall Comparison:
Definition:
Precision: Measures the relevancy of retrieved documents. It calculates the ratio of relevant documents retrieved to the total documents retrieved.
Recall: Measures the coverage of relevant documents. It calculates the ratio of relevant documents retrieved to the total relevant documents in the collection.
Focus:
Precision: Focuses on the quality of retrieved results, aiming to minimize false positives.
Recall: Focuses on the quantity of relevant results, aiming to minimize false negatives.
Trade-off:
Precision-Recall Trade-off: There is often an inverse relationship between precision and recall. Increasing precision may decrease recall and vice versa.
Use Cases:
Precision: Important in scenarios where false positives can have severe consequences (e.g., medical diagnoses).
Recall: Important in scenarios where missing relevant results is costly (e.g., information retrieval systems).
Formula:
Precision: (True Positives) / (True Positives + False Positives)
Recall: (True Positives) / (True Positives + False Negatives)
Objective:
Precision: Maximize precision when false positives are undesirable.
Recall: Maximize recall when missing relevant results is unacceptable.
Interplay:
Balancing precision and recall is critical, and the F1-score is often used to find a compromise between the two metrics.
In summary, precision and recall are both essential metrics in information retrieval, with different focuses and trade-offs. Precision emphasizes result quality, while recall emphasizes result quantity, and they are often considered together to evaluate the performance of retrieval systems.
What is the significance of differentiating between in-links and
out-links for document ranking on the web? Illustrate your
answer using an appropriate example.
Significance of In-Links and Out-Links:
In-Links (Backlinks):
Represent links from other web pages pointing to a specific page.
Indicate the page’s authority and relevance within the web ecosystem.
Used to assess a page’s importance and trustworthiness.
Out-Links (Outbound Links):
Represent links from a specific page to other web pages.
Indicate the content’s context and reference to related topics.
Can influence the ranking of linked pages and contribute to information flow.
Illustration with Example:
Example:
Consider two web pages, Page A and Page B.
Page A has numerous high-quality in-links from authoritative websites in the same domain.
Page B has numerous out-links to relevant, reputable sources on the topic it covers.
Significance:
Page A is likely considered an authoritative source due to its valuable in-links, potentially ranking higher in search results.
Page B, while not an authority itself, contributes to the web’s information flow by referencing authoritative sources. This can indirectly benefit its ranking by enhancing its context and credibility.
In summary, differentiating between in-links and out-links is significant in web document ranking. In-links indicate a page’s authority, while out-links contribute to context and information flow, influencing a page’s relevance and credibility in search engine rankings.
What is tf-idf? How does it help identify those web-pages that
are most relevant to a given web query.
TF-IDF (Term Frequency-Inverse Document Frequency):
A numerical statistic used in information retrieval.
Combines term frequency (TF) and inverse document frequency (IDF) to evaluate a term’s importance in a document within a collection.
Relevance to Web Page Ranking:
Term Frequency (TF):
Measures how often a term appears in a web page.
High TF for query terms indicates their importance within a page.
Inverse Document Frequency (IDF):
Measures how rare a term is across the entire collection of web pages.
High IDF for query terms indicates their uniqueness.
Ranking Relevance:
TF-IDF assigns higher scores to web pages where query terms have high TF and IDF.
Pages with query terms occurring frequently within them but not too common across the entire collection are considered more relevant.
Example:
For a query containing “artificial intelligence” and “machine learning,” a web page mentioning these terms frequently (high TF) but not overly common across the web (high IDF) is ranked higher, as it is more relevant to the query.
In summary, TF-IDF helps identify the most relevant web pages for a given query by considering both the term’s frequency within a page (TF) and its rarity across the entire collection (IDF), allowing it to prioritize pages with contextually important terms.
Why does Google’s PageRank algorithm approximate the
behaviour of a theoretical surfer, rather than focusing on the
behaviour of an actual surfer of the web?
PageRank Approximates a Theoretical Surfer:
Scalability: Modeling the behavior of every actual web surfer is impractical due to the immense scale of the web.
Consistency: A theoretical surfer consistently explores links and provides a more predictable framework for ranking web pages.
Algorithmic Efficiency: Computationally, simulating a theoretical surfer is more efficient than tracking the diverse behaviors of real users.
Focus on Link Structure: PageRank’s focus is on analyzing the web’s link structure and identifying authoritative pages, which aligns with a theoretical surfer’s goal.
Relevance to Web Ranking:
PageRank’s simplified model still yields valuable insights into web page importance and relevance based on link analysis, helping rank web pages effectively.
In summary, PageRank approximates a theoretical surfer’s behavior for scalability, consistency, and computational efficiency while still providing valuable insights for web page ranking based on link structure.
What is K nearest neighbours (KNN) and how are they used to
solve problems?
K Nearest Neighbors (KNN):
A supervised machine learning algorithm.
Used for classification and regression tasks.
Operation:
Classifies or predicts a data point based on the majority class or average value of its K nearest data points in the training dataset.
Utilizes a distance metric (e.g., Euclidean distance) to determine proximity.
Usage:
Classification: Assigns a class label to a data point based on the class labels of its K nearest neighbors.
Regression: Predicts a numerical value based on the average of the values of its K nearest neighbors.
Commonly used for pattern recognition, recommendation systems, and anomaly detection.
Parameters:
K value determines the number of neighbors considered.
Choice of K impacts algorithm performance and robustness.
Pros:
Simple and intuitive.
Works well for small to medium-sized datasets.
Non-parametric and adaptable to different data distributions.
Cons:
Sensitive to noisy data.
May require careful preprocessing and tuning.
Slower for large datasets due to calculating distances for all data points.
In summary, KNN is a versatile algorithm used for classification and regression tasks by considering the class or values of the nearest neighbors of a data point, making it suitable for various machine learning applications.
The solutions space used by KNN can be thought of a
tessellation of that space, creating a Voronoi diagram. Justify
this claim using a diagram to illustrate your answer.
Tessellation of Solution Space:
In KNN, the solution space is divided into regions based on the K nearest neighbors.
Each region is centered around a data point from the training set.
These regions collectively create a tessellation of the solution space.
Voronoi Diagram Illustration:
A Voronoi diagram consists of polygons (Voronoi cells) where each polygon represents an area closest to a specific data point.
These polygons correspond to the regions in KNN, centered around the training data points.
The Voronoi diagram visually represents how KNN classifies or predicts based on the closest neighbors.
Justification:
KNN assigns a data point to the class or value most prevalent among its K nearest neighbors.
This corresponds to selecting the Voronoi cell associated with the closest training data point.
The tessellation created by Voronoi cells captures the influence regions of each training data point on the solution space.
In summary, the tessellation of the solution space in KNN resembles a Voronoi diagram, where each region corresponds to a Voronoi cell centered around a training data point. This diagram illustrates how KNN makes classification or regression decisions based on proximity to training data.
Outline how the KNN algorithm might be applied to the problem
of retrieving source code, for possible reuse.
Applying KNN to Source Code Retrieval:
Data Representation:
Represent source code files as numerical feature vectors.
Features can include code structure, comments, variable names, and more.
Training Data:
Build a training dataset containing labeled source code snippets.
Labels indicate the purpose, functionality, or category of each snippet.
Distance Metric:
Choose an appropriate distance metric (e.g., cosine similarity or Euclidean distance) to measure similarity between code snippets.
KNN Configuration:
Determine the value of K (number of nearest neighbors) based on experimentation and dataset size.
Retrieval Process:
Given a new source code snippet for reuse, convert it into a feature vector.
Calculate distances to all snippets in the training dataset.
Select the K nearest neighbors (most similar code snippets).
Decision Making:
For classification purposes, determine the majority class among the K neighbors to classify the new code snippet.
For retrieval, return the K most similar snippets to the user.
Evaluation:
Measure the algorithm’s performance using evaluation metrics like precision, recall, or F1-score.
Refine the model and parameters as needed for better retrieval results.
Advantages:
KNN can capture code similarity based on various features, facilitating code reuse.
It can accommodate different programming languages and coding styles.
Challenges:
Choosing relevant features and an appropriate distance metric is crucial for accurate retrieval.
Handling large codebases may require efficient data structures and search algorithms.
Handling semantic understanding and contextual code relevance is a more complex problem.
In summary, applying KNN to source code retrieval involves feature representation, training with labeled data, and using a distance metric to find the most similar code snippets in a training dataset for reuse or classification.
Briefly describe how heuristic search can produce superior
results to “blind” search.
Heuristic Search vs. Blind Search:
Guidance vs. No Guidance:
Heuristic search uses heuristics or rules to guide the search towards promising solutions.
Blind search explores without specific guidance, often in a brute-force manner.
Efficiency vs. Inefficiency:
Heuristic search focuses on relevant paths, reducing unnecessary exploration.
Blind search can be inefficient for large or complex search spaces.
Optimal vs. Suboptimal Solutions:
Heuristic search aims for optimal or near-optimal solutions using domain-specific knowledge.
Blind search may find solutions but doesn’t prioritize optimality.
Example: Maze Solving:
In a maze, heuristic search may use distance to the goal as a heuristic, guiding towards the shortest path.
Blind search may explore all possible paths, including longer ones, before finding a solution.
In summary, heuristic search’s guidance and domain-specific knowledge lead to more efficient exploration and the potential for optimal solutions compared to blind search in complex problem spaces.
Describe in detail the A* search algorithm. You may find it
useful to use examples or diagrams to present your answer
Initialization: Start with an open list containing only the initial node and a closed list that is empty.
Loop: Repeat the following steps until the open list is empty:
Choose Node: Select the node with the lowest score from the open list. The score is calculated as
f(n)=g(n)+h(n), where
g(n) is the cost to reach the node, and
h(n) is the estimated cost from the node to the goal (heuristic).
Goal Check: If the chosen node is the goal, backtrace the path from the goal to the start and return this path.
Generate Successors: Identify all the neighbors of the current node and calculate their scores.
Update Lists: For each neighbor, if it is not in the open list, add it. If it is already in the open or closed list with a higher score, update it with the lower score and change its parent to the current node.
Move Node: Move the current node from the open list to the closed list.
Completion: The algorithm terminates when either the goal is found (and the path is returned) or the open list is empty (indicating no path).