Text Representation Flashcards
What are information retrieval systems? What are its typical issues?
• Information retrieval deals with the problem of locating relevant documents with respect to the
user input or preference
• Typical systems
– Online library catalogs
– Online document management systems
• Typical issues
– Management of unstructured documents
– Approximate search
– Relevance
What are the two types of information retrieval systems?
• Pull Mode (search engines)
– Users take initiative
– Ad hoc information need
• Push Mode (recommender systems)
– Systems take initiative
– Stable information need or system has good knowledge about a user’s need
What are the methods used for information retrieval systems?
• Document Selection (keyword-based retrieval)
– Query defines a set of requisites
– Only the documents that satisfy the query are returned
– A typical approach is the Boolean Retrieval Model
• Document Ranking (similarity-based retrieval)
– Documents are ranked on the basis of their relevance with respect to the user query
– For each document a “degree of relevance” is computed with respect to the query
– A typical approach is the Vector Space Model
How could we calculate the precision, recall, and F-score of a infomration retrieval system?
The precision is related to the intersection of the relevant documents and what is retrieved as a percentage of the total that is retrieved
The recall otherwise is also the intersection between the relevant documenta and what is retrieved but with relation to the total of the relevant documents (percentage of relevant information retrieved)
F-score=(2precisionrecall)/(precision+recall)
What is the Bag-of-words representation? What is it used for?
• Text is represented as a table
– Columns (features/variables) identify vocabulary words
– Rows (examples/data points) identify documents
• The values can represent
– The presence of the word
– The frequency of the word inside the document
– …
• It is a sparse representation since a document usually contains much fewer words than the entire vocabulary so that most values in the vector are zeros
• Vector space model
– Documents thus become vectors in the feature space
The Bag-of-Words (BoW) representation is a simple and widely used technique in natural language processing (NLP) and text analysis. It represents text data as a collection of words, disregarding grammar, word order, and context, while keeping track of word occurrences.
In this representation, a document (or text) is converted into a “bag” containing the unique words that appear in it. Each word is treated as an independent feature, and the frequency of each word in the document is recorded.
Uses of Bag-of-Words:
1. Text Classification: BoW is often used in applications like spam detection, sentiment analysis, and topic categorization, where understanding word presence or frequency helps classify text. 2. Information Retrieval: In search engines, it helps match documents to user queries by comparing the words present. 3. Feature Extraction for Machine Learning: BoW transforms text into numerical data, which can then be used as input for machine learning models. 4. Clustering and Similarity: It can be used to measure similarity between documents, such as in recommendation systems or plagiarism detection.
Despite its simplicity, BoW can overlook the context or meaning of words, which is why it is often supplemented or replaced by more advanced techniques like TF-IDF, word embeddings, or transformer-based models.
How could we represent texts in a way that we could compare them? What are some of the issues of this approach?
• A document is vectors in high-dimensional space corresponding to all the keywords
• Relevance is measured with an appropriate similarity measure
defined over the vector space
• Issues
– How to select keywords to capture “basic concepts” ?
– How to assign weights to each term?
– How to measure the similarity?
What is inverse document frequency?
• It measures how much information the word provides (how much it is surprising to have it in
the document).
• Penalizes the words that frequently occur in many documents,
• Where k is the number of documents in which w appears M is the number of documents.
• When w might is not in the corpus, the formula leads to a division-by-zero so the adjusted version with (k+1) at the denominator is used.
How could we calculate the similarity between documents?
To calculate the similarity between documents, you can use various techniques depending on how the text is represented. Common methods include:
- Cosine Similarity• How it works: Treat each document as a vector in a multidimensional space (e.g., using a Bag-of-Words or TF-IDF representation). Cosine similarity measures the cosine of the angle between two vectors.
• Why it’s useful: It focuses on the orientation (word distribution) rather than the magnitude (length) of the vectors, making it effective for comparing text documents. - Jaccard Similarity• How it works: Compares the overlap between two sets of words. The similarity is calculated as the size of the intersection of the sets divided by the size of their union.
• Why it’s useful: This method works well when you are interested in how much the documents overlap in terms of the words they contain. - Euclidean Distance• How it works: Measures the straight-line distance between two document vectors in a multidimensional space.
• Why it’s useful: Simpler but sensitive to document length, so it’s often used with normalized data. - Manhattan Distance• How it works: Sums the absolute differences between corresponding elements of two document vectors.
• Why it’s useful: Like Euclidean distance, it measures the literal difference, often used in sparse data. - Soft Similarity• How it works: Extends cosine or Jaccard similarity by considering semantic similarity between words (e.g., synonyms or related terms). This often involves embeddings like Word2Vec or GloVe.
• Why it’s useful: Captures contextual meaning rather than just raw word matches. - Dynamic Time Warping (DTW)• How it works: Used when comparing sequences of words or sentences. It calculates the optimal alignment between two sequences, minimizing the “distance” between them.
• Why it’s useful: Effective for documents with a temporal or sequential aspect.
Choosing the Method:
• Simple Word Matches: Use cosine or Jaccard similarity for basic document representations (e.g., BoW or TF-IDF). • Semantic Meaning: Use methods involving word embeddings or neural models if word context or meaning is critical. • Sequence Alignment: Use DTW or similar methods for sequential comparisons.
For real-world applications, cosine similarity with TF-IDF is a common starting point because it balances simplicity and effectiveness.
What are some important preprocessing methods to apply before text representation?
• Preliminary Steps
– Remove non-content related information, e.g. <HTML> tags
– Lowercase the text (can lose information: e.g. ‘WHO’ vs ‘who’)
– Remove punctuation
• Stop words elimination
– Stop words are elements that are considered uninteresting with respect to the retrieval and thus are
eliminated. For instance, “a”, “the”, “always”, “along”
• Word stemming/Lemmatization
– Different words that share a common prefix are simplified and replaced by their common prefix
– For instance, “computer”, “computing”, “computerize” are replaced by “comput”; ”fishing” with “fish”,
etc
What are word embeddings? What about embeddings?
• Word Embeddings
– Dense representations of words rather than documents
– Stores each word in as a point in continuous space
– A word is represented by a vector of fixed number of dimensions (between 25 or 1000)
– Generated from a huge corpus using supervised methods
– Dimensions are basically projections along different axes
• Embeddings are supervised models
– Trained to predict missing word based on surrounding context
– Note that not many words fit and those that do are semantically related to each other
What are entity embeddings?
Entity embeddings map categorical variables into Euclidean spaces (the entity embeddings)
The mapping is learned by a neural network using a supervised training process
Reduces memory usage and speeds up
neural networks compared with one-hot encoding
Maps similar values close to each other in the embedding space possibly highlighting intrinsic properties of the categorical variables