P3 - Linguistic Nuances & Natural Language Processing Flashcards
What is Natural Language Processing (NLP)?
Definition: The ability of computers to understand and process human language.
Example: Chatbots, voice assistants, and translation tools.
NLP enables chatbots to interpret user input and generate meaningful responses.
Why do Linguistic Nuances Matter?
Challenges: Ambiguity, emotion, slang, and context.
Real-world issues: Misunderstanding user intent or tone (e.g., sarcasm).
The Five Stages of NLP
- Lexical Analysis
- Syntactic Analysis (Parsing)
- Semantic Analysis
- Discourse Integration
- Pragmatic Analysis
What is lexical analysis?
Breaking text into words and identifying their roles.
Ex: Sentence: “I’ve just been in a car accident.” - Break down the sentence into words
What is syntactic analysis?
Understanding the structure of sentences (e.g., subject, verb, object).
Ex: Sentence: “I’ve just been in a car accident.”
- Identify grammar and relationships (e.g., subject = “I”).
What is semantic analysis?
Interpreting the meaning of words and sentences.
Ex: Sentence: “I’ve just been in a car accident.”
- Determine the meaning (making a claim)
What is discourse integration?
Understanding context within a conversation.
Ex: Sentence: “I’ve just been in a car accident.”
- Relate to previous chatbot interactions (is user a customer?)
What is pragmatic analysis?
Considering cultural, social, and legal contexts.
Ex: Sentence: “I’ve just been in a car accident.”
- Context (are you the driver? are you upset? legal situation?)
What is a core challenge in lexical analysis regarding punctuation and whitespace?
Punctuation and whitespace may or may not be treated as separate tokens. This decision affects how the next stage - syntactic analysis - interprets sentence structure.
How can hyphenated words, contractions, emoticons, and URLs complicate tokenisation?
Different tokenisation strategies might split these constructs in various ways. For instance, hyphenated words or contractions may be split into multiple tokens or kept intact, leading to potential inconsistencies in how the input is later processed.
Why is tokenisation especially challenging in languages like Chinese or agglutinative languages?
Languages written in a continuous script (like Chinese) lack clear word boundaries, making it difficult to define a “word.” Similarly, agglutinative or fusional languages (such as Korean or Spanish) have complex word formations—like varied verb conjugations or suffixes—that challenge standard tokenisation rules.
What problems do user typos and proper nouns introduce in lexical analysis?
Typos can distort the intended meaning of words, while proper nouns that include spaces, apostrophes, or hyphens may be incorrectly segmented, resulting in tokenisation errors that affect subsequent analysis stages.
What are some strategies to address tokenisation challenges in lexical analysis?
Context-Aware Tokenisation
Complex Heuristics
Single-Character Tokenisation
How does Context-Aware Tokenisation address tokenisation challenges in lexical analysis?
Using language models (e.g., the Charformer model by Tay et al.) that learn from context to split tokens effectively.
How does Complex Heuristics address tokenisation challenges in lexical analysis?
Employing techniques like finite state machines (e.g., using capital letters as cues for proper nouns) despite their limitations.
How does Single-Character Tokenisation address tokenisation challenges in lexical analysis?
For numbers, tokenising each character separately may improve mathematical processing, though it demands more processing power.
What are common syntactic analysis failures as seen in real-world chatbot applications?
Failures often arise from misinterpreting sentence structure. For example, small punctuation errors or ambiguous phrasing can cause chatbots to jumble words or misassign meaning—resulting in orders being misinterpreted, as seen in documented drive-thru chatbot mishaps.
How can syntactic analysis issues be mitigated?
Improvements can be achieved by:
- Training on diverse voice data and incorporating a wider range of tonal variations.
- Using autocorrection and grammar correction algorithms to pre-process text.
- Designing systems with restricted inputs (e.g., a fixed menu) where syntactic variability is reduced.
What semantic analysis challenges impact natural language understanding?
Idioms and Non-Literal Language: Phrases like “break a leg” that cannot be understood literally.
Homonyms: Words like “bass” (referring to a fish or a musical tone) require context to disambiguate.
Ambiguous Company Names: Terms like “Apple” or “Target” that have common as well as corporate meanings.
What additional semantic challenges are highlighted in linguistic analysis?
Lexical Disambiguation: Determining the correct meaning of a word with multiple definitions based on sentence context.
Lexical Reversibles: Sentences where object positions may be interchanged yet retain meaning.
Neologisms: The need to interpret newly coined or unfamiliar words.
Idiomatic Usage and Metaphors: Understanding non-literal language relies on common cultural knowledge rather than direct interpretation.
What constitutes pragmatic analysis failures in language processing?
Pragmatic failures occur when systems misunderstand context, tone, or cultural nuances. This can lead to misinterpretations in customer support or other real-world scenarios, where the intended meaning (or sentiment) of a message is lost or misconstrued.