Text Mining Flashcards by Phil Pieper

Was sind Bag-of-Tokens Approaches?

Zählen der Wörter in einem Text

How well did you know this?

Not at all

Perfectly

Was ist das Problem von Bag-of-Tokens Approaches?

Looses all order-specific information!
Reduces context information.

How well did you know this?

Not at all

Perfectly

Was ist Syntax?

ordering of words and its possible effect on meaning

How well did you know this?

Not at all

Perfectly

Was ist Semantik?

concerns the (literal) meaning of words, phrases, and sentences

How well did you know this?

Not at all

Perfectly

Was sind Pragmatics?

concerns the overall communicative and social
context and its effect on interpretation

How well did you know this?

Not at all

Perfectly

Wie kommt man von Flat Text zu Struktur und Bedeutung?

How well did you know this?

Not at all

Perfectly

Beschreib Word-level ambiguity

How well did you know this?

Not at all

Perfectly

Beschreib Semantics and Anaphora resolution

How well did you know this?

Not at all

Perfectly

Beschreib Syntactic ambiguity

How well did you know this?

Not at all

Perfectly

Beschreib Presupposition and pragmatic inferences

How well did you know this?

Not at all

Perfectly

Was ist Syntactic Parsing?

Produces the correct syntactic parse tree for a sentence

How well did you know this?

Not at all

Perfectly

How many syntactic interpretations does a sentence ending in n prepositional phrases have?

over 2^n

How well did you know this?

Not at all

Perfectly

Was ist eine kontextfreie Grammatik?

How well did you know this?

Not at all

Perfectly

Was ist Probabilistic Structure Parsing?

How well did you know this?

Not at all

Perfectly

Was ist Shallow Natural Language Processing?

How well did you know this?

Not at all

Perfectly

Was ist Morphology?

the field of linguistics that studies the
internal structure of words

How well did you know this?

Not at all

Perfectly

Was ist ein Morpheme?

the smallest linguistic unit that has
semantic meaning

How well did you know this?

Not at all

Perfectly

Was ist Morphological Analysis?

How well did you know this?

Not at all

Perfectly

What is Part-of-Speech (POS) Tagging?

How well did you know this?

Not at all

Perfectly

Was ist Phrase Chunking?

How well did you know this?

Not at all

Perfectly

Was ist Semantic Role Labeling?

How well did you know this?

Not at all

Perfectly

Was ist Semantic Information Extraction (IE)?

How well did you know this?

Not at all

Perfectly

Wobei hilft Shallow NLP?

e. g.:
* Question Answering
* Text Summarization

How well did you know this?

Not at all

Perfectly

Was ist der Unterschied zwischen Informations Retrieval und Information Extraction?

How well did you know this?

Not at all

Perfectly

# Information Retrieval Models Beschreib das Boolean Model

# Information Retrieval Models Beschreib das Vector Space Model

Was spezifiziert das Vector Space Model nicht?

Welche Methoden basierend auf frequency gibt es, um Wörtern gewichte zu geben?

Wie bestimmt man Raw_TF?

Raw_TF = f(t,d): how many times term t appears in doc d

Wie normalisiert man die Raw_TF?

Wie berechnet man die Inverse Document Frequency (IDF)?

Wie funktioniert TF-IDF Weighting?

Was sind stop words?

Wörter, die irrelevant sind für die Bedeutung eines Satzes (z. B. a, the, of, ...)

Was ist stemming?

Stemming is a text preprocessing technique used in natural language processing (NLP) to reduce inflected or derived words to their base or root form. running, runs, ran -> run

Warum werden Stop Words entfernt?

Warum nutzt man Stemming?

Beschreib den Porter Algorithm

Basic stemming algorithm

Was sind Lemmatizers?

Wie funktioniert Word2Vec (CBOW) mit MLP?

Was ist ein Language Model?

Welches Model eignet sich für Language Models?

Wie kann man Elman's model erweitern, sodass man langzeit Abhängigkeiten berücksichtigen kann? Welches Problem löst es zusätzlich?

Long-Short-Term Memory

Was ist Long Short-Term Memory in Recurrent Networks?

Was ist das Problem des verschwindenden Gradienten?

Das Problem tritt auf, wenn Gradienten exponentiell abnehmen, während sie sich rückwärts durch die Schichten des Netzwerks ausbreiten. Dies führt dazu, dass die Gradienten in den frühen Schichten des Netzes extrem klein werden oder sogar verschwinden. * Lernbehinderung: Gewichtsaktualisierungen in frühen Schichten werden vernachlässigbar klein, was das Lernen erschwert oder verhindert * Schwierigkeiten bei langfristigen Abhängigkeiten: In sequentiellen Daten können Beziehungen zwischen weit entfernten Elementen nicht effektiv gelernt werden

Was ist das Hauptmerkmal von LSTMs?

Das Hauptmerkmal von LSTMs ist ihre Fähigkeit, Informationen über lange Zeiträume hinweg zu speichern und zu nutzen, was durch eine spezielle Zellstruktur ermöglicht wird.

Aus welchen Hauptkomponenten besteht ein LSTM?

* Zelle (Cell): Speichert den Zellzustand, der Informationen über verschiedene Zeitschritte hinweg behält. * Eingangstor (Input Gate): Bestimmt, welche neuen Informationen in den Zellzustand aufgenommen werden. * Vergessenstor (Forget Gate): Entscheidet, welche Informationen aus dem Zellzustand gelöscht werden. * Ausgangstor (Output Gate): Kontrolliert, welche Informationen aus dem Zellzustand für die nächste Ausgabe verwendet werden.

# LSTMs Welche Aktivierungsfunktion nutzt das Vergessenstor?

# LSTMs Welche Funktionen nutzt das Eingangstor?

# LSTMs Wie wird der Zellzustand aktualisiert?

# LSTMs Welche Funktionen nutzt das Ausgangstor?

Beschreib die Encoder-Decoder Architecture für Sequence-to-Sequence Transduction

Was ist ein Transformer Block in transformer encoder-decoder networks?

Was ist der Attention-Mechanismus?

Der Attention-Mechanismus weist verschiedenen Teilen der Eingabe unterschiedliche Gewichtungen zu, basierend auf ihrer Relevanz für die aktuelle Aufgabe. Dies geschieht durch die Berechnung von "weichen" Gewichten für die numerischen Repräsentationen (Embeddings) der Eingabeelemente.

Wie ist eine Feedforward Encoder-Decoder Architecture aufgebaut?

Was ist masked language modelling?

MLM trainiert ein Modell darauf, maskierte (verdeckte) Tokens in einer Eingabesequenz vorherzusagen. Dabei werden zufällig ausgewählte Wörter in einem Text durch ein spezielles [MASK]-Token ersetzt, und das Modell muss lernen, diese maskierten Wörter basierend auf dem Kontext zu rekonstruieren.

Wobei schafft masked language modelling Abhilfe?

Was ist contrastive learning?

Does Contrastive Learning constrain the positive pairs to be similar

No!

Wann ist eine Fähigkeit (ability) emergent?

The Word2Vec model is a representation of not only syntactic, but also semantic meanings of words. Stimmt das?

Which of the following techniques can be used as a preprocessing step for text classification? 1. Random Sampling 2. Stopword Removal 3. Feature Scaling 4. Dimensionality Reduction

alle

Stemmers are generally faster than lemmatizers, but may not always produce a proper dictionary word. Stimmt das?

What is Latent Semantic Indexing (LSI)?

LSI is a statistical technique used to identify latent relationships between terms in a document corpus.

If we write an information retrieval algorithm that tends to retrieve as many documents as possible for a given query, it usually has a rather low precision and a rather high recall. Stimmt das?