L6: NLP - Introduction to Natural Language Processing Flashcards by Susanne t

TEXT DATA, TWITTER EXAMPLE

{“id”: “1579502253039550466”, “conversation_id”: “1579238912039403521”, “entities”:
{“mentions”: [{“start”: 0, “end”: 14, “username”: “SaucyforTesla”, “id”:
“1334784052386119680”}], “urls”: [{“start”: 171, “end”: 194, “url”: “https://t.co/qrfOZy6wQe”,
“expanded_url”: “https://twitter.com/messages/compose?recipient_id=10850192”,
“display_url”: “twitter.com/messages/compo\u2026”, “status”: 404, “unwound_url”:
“https://twitter.com/messages/compose?recipient_id=10850192”}]}, “author_id”:
“10850192”, “public_metrics”: {“retweet_count”: 0, “reply_count”: 0, “like_count”: 0,
“quote_count”: 0}, “text”: “@SaucyforTesla Hi, Larry. Our team would be happy to discuss
your ideas in more detail. When you have a moment, please feel free to DM us with
additional details. -Rachel https://t.co/qrfOZy6wQe”, “referenced_tweets”: [{“type”:
“replied_to”, “id”: “1579238912039403521”}], “created_at”: “2022-10-10T16:00:57.000Z”,
“edit_history_tweet_ids”: [“1579502253039550466”], “full_text”: “@SaucyforTesla Hi, Larry.
Our team would be happy to discuss your ideas in more detail. When you have a moment,
please feel free to DM us with additional details. -Rachel https://t.co/qrfOZy6wQe”,
“unixTime”: 1665410457.0, “twitter_name”: “GM”}

How well did you know this?

Not at all

Perfectly

TEXT DATA, AMAZON REVIEW EXAMPLE

{
‘overall’: 5.0,
‘verified’: True,
‘reviewTime’: ‘03 21, 2014’,
‘reviewerID’: ‘A2HQAG6N6J6WF7’,
‘asin’: ‘6073894996’,
‘reviewerName’: ‘Speed3’,
‘reviewText’: ‘Plug it into your accessory outlet and charge your USB cabled device. I use it for charging my iPhone while I
drive to and from work.’,
‘summary’: ‘It works’,
‘unixReviewTime’: 1395360000,
}

How well did you know this?

Not at all

Perfectly

Corpora and Documents

:
* Corpora are typically thought of as meaningfully cohesive collections of Documents.
Often a class of documents, e.g.
* From the same source (e.g. amazon reviews)
* From different sources, but on the same topic (different historical sources about
WW2)
* Filtered by some set of criteria (tweets containing the word “ham”, or being a 5-
star review)

How well did you know this?

Not at all

Perfectly

Documents are uniform collections of

Words & sentences (potential features)
Auxiliary data, e.g. time, place, and other metainfo (potential features)

How well did you know this?

Not at all

Perfectly

DERIVING DATA FROM DOCUMENTS

From unstructured data to meaningful features
Often we want more information than what is in the document text or meta data. E.g.
* How many words/sentences are in it?
* How long ago was it written?
* How long is the text?
* Is the text generally happy or unhappy? What is the sentiment of the text?
This information can often be derived from the documents. We do this by “applying
functions to each document”.

How well did you know this?

Not at all

Perfectly

TYPICAL FUNCTIONS WE USE TO EXPLORE TEXT DATA

How many words do they write?
How many sentences?
What is the reading difficulty?
Are people generally happy or unhappy?
Anything else anyone can think of?

How well did you know this?

Not at all

Perfectly

LIX SCORE

Lix is a useful measure for calculating
how difficult a word is. It measures

How long are sentences

How large a percentage of the words
are longer than 6 characters

Formally,
Word_count / sentence_count +
words_longer_than_6 * 100 /
word_count

How well did you know this?

Not at all

Perfectly

SENTIMENT ANALYSIS

Sentiment Analysis is an evaluation of how positive or negative a text is.
The simple way to do it is to have two lists of words:
* Positive words
* Negative words
And then simply count how many of each there are in a text, and add them up.
Alternatively, some words are more negative or positive than others. If we give each of
them a negativity score or a positivity score, we can do a more nuanced analysis of the
sentiment

How well did you know this?

Not at all

Perfectly

FROM DESCRIPTION TO CLASSIFICATION (AND PREDICTION)

We have a variety of features now. How do we think of features as belonging to a class,
and therefore potentially helpful in terms of making predictions?

The data-related questions:
How do we articulate regularities/relevant features?
How do we spot them?
How do we test them?

The learning related questions?
How do we make predictions based on ”known” feature-class relationships?
To what extent does each feature predict belonging to a specific class

How well did you know this?

Not at all

Perfectly

THE FUNDAMENTAL QUESTIONS IN
MACHINE LEARNING (INCL NLP):

What makes something something, in a way that sets it apart from all the other things it
could be?
* What makes a duck a duck, and not a goose?
* What makes a positive review a positive review, and not a lukewarm review?
* What makes a customer review negative about service, but positive about food, and
not vice versa?
How can we use those particularities about the things we are looking at, and make
them useful?
Either to better understand the things we are looking at,
And/or to make predictions about when something is indeed something, and not
something else

How well did you know this?

Not at all

Perfectly

CORE CONCEPTS: FEATURES AND CLASSES

Features: Specific (processed
data) that define a thing both in its
own right, and in contrast to other
things

Bird features:
* Length of beak
* Shape of beak
* Color of beak
* Body posture
* Color of feathers
* Size
Etc

Classes: collections of things with
shared features that we want to
think of as a “something

Bird classes:
* Hen
* Duck
* Goose
etc

How well did you know this?

Not at all

Perfectly

WHAT’S HARD ABOUT THAT?

Features and classes seem easy
enough, right?
For birds, biologists have already
done most of the work for us and
defined features, like birds. They
have structured the data for us.
When we work with unstructured
text data, we often need to identify
and define these ourselves.

{
‘overall’: 5.0,
‘verified’: True,
‘reviewTime’: ‘03 21, 2014’,
‘reviewerID’: ‘A2HQAG6N6J6WF7’,
‘asin’: ‘6073894996’,
‘reviewerName’: ‘Speed3’,
‘reviewText’: ‘Plug it into your accessory
outlet and charge your USB cabled
device. I use it for charging my iPhone
while I drive to and from work.’,
‘summary’: ‘It works’,
‘unixReviewTime’: 1395360000,
}

How well did you know this?

Not at all

Perfectly

HOW?

The process of
1. Defining
2. Identifying, and
3. Extracting
features and classes from your
data and structuring it yourself is
often the most difficult part of the
process.
And it is what we will mostly talk
about today.

How well did you know this?

Not at all

Perfectly

BAYES LAW

P(A l B) = (P(B l A)P(A) / P(B)

P(A | B) = What is a probability that a document belongs to class A, given that
feature B appears?

P(B | A) = how often do we see the feature in Class A?

P(A) = how often do we see Class A in the corpus in general?

P(B) = how frequent is the feature in general

How well did you know this?

Not at all

Perfectly

USING BAYES LAW ON A DATASET

We take our input, extract features
from it.
We add KNOWN class label(s) to
those features.
We tell our Bayesian classifier,
“these features indicate that the
item that produced the featureset
belongs to class X”
The machine updates the model,
according to Bayes law.
If we give it a new featureset, it makes
predictions based on what it knows.

This is a SUPERVISED machine
learning approach.
It is supervised because:
1. Classes are known in advance
2. We know which documents
belong to which class
3. We tell the model how documents
and features are related.
We will do some Unsupervised

EVALUATING A MODEL
A confusion matrix
Evaluation: Precision, Recall, accuracy

How well did you know this?

Not at all

Perfectly

WEAKNESSES OF NAÏVE BAYES

Study These Flashcards

So that worked pretty well! But what the weaknesses of bayes?
1. Only categorical classes
* i.e. we can’t estimate likelihoods or continuous outcomes
2. It assumes that all features are independent
* We can ”overload” on some features, (e.g. ‘sentiment score’ AND including ‘great’ or
‘excellent’ in our classifier)
3. It assumes that all features ”weight” the same
* We maybe believe that some features should be more important than others
4. It assumes that we already know what we are looking for!!!
* We provide it labels and data with those labels (supervise it)
* Next we will look at more explorative, unsupervised classification

BRIEF PRIMER: VECTORIZATION OF DOCUMENTS

Study These Flashcards

Note this is the old fashioned approach: It works, we have better approaches now.
Vectors are long lists of numbers. You may know them from High School Math, e.g.
(4, 0, 1, 9, 0, 12, 4, …, n)
You definitely know them from math as “points” in a cartesian space, e.g.
Point1 = (x1, y1,z1)
Point2 = (x2, y2,z2)
etc

FROM DOCUMENT TO VECTORS

Study These Flashcards

The traditional way is to create a vector where each dimension corresponds to a word, and
the value in each dimension corresponds to the number of that word in a document.
E.g.
Small corpus: ”I love dogs”, ”I hate dogs”, ”I love cats”, “i play games”
Words = I, love, hate, dogs, cats, play, video, games. Eight dimensions.
Dimensions = [I, cats, dogs, hate, love, play, video, games]
I love cats -> [1, 1, 0, 0, 1, 0, 0, 0]
I love dogs -> [1, 0, 1, 0, 1, 0, 0, 0]
I play video games -> [1, 0, 0, 0, 0, 1, 1, 1]

BUT WHY VECTORIZE?

Study These Flashcards

We can use points in space to find distances!
Who remembers Pythagora’s?
How do we find the distance between two points, P1 and P2?
A^2 + B^2 = distance^2 <=> distance = sqrt(A^2 + B^2)

DISTANCES BETWEEN SENTENCES

Study These Flashcards

I love cats -> [1, 1, 0, 0, 1, 0, 0, 0]
I love dogs -> [1, 1, 0, 1, 0, 0, 0, 0]
I play video games -> [1, 0, 0, 0, 0, 1, 1, 1]
What is the distance between each of these sentences? Which ones are most similar? Just
at face value?

What is the distance between each of these sentences? Which ones are most similar? Just
at face value?

I love cats -> [1, 1, 0, 0, 1, 0, 0, 0]
I love dogs -> [1, 1, 0, 1, 0, 0, 0, 0]
I play video games -> [1, 0, 0, 0, 0, 1, 1, 1]

Study These Flashcards

1 – 2: sqrt((1-1)^2, ( 1-1^)2 , ( 0-0)^2, ( 0-1^)2, (1-0)^2, (0-0)^2, (0-0)^2, (0-0)^2) = 1.44
1- 3: sqrt((1-1)^2, ( 1-1^)2 , ( 0-0)^2, ( 0-1^)2, (1-0)^2, (0-1)^2, (0-1)^2, (0-1)^2) = 2.23
So 1 and 2 are closer to each other using this method. Just as we would expect. Nice!

PROBLEMS WITH OLD SCHOOL VECTORIZATION

Study These Flashcards

The vectors are ridiculously long, and most of them are 0.
* For even a small corpus, we have tens or hundreds of thousands of words, most of
which appear in just a few documents
Different words that mean the same count as completely different words:
* e.g. i love cats, i hate cats, i like cats -> similar distance.
* But “i like cats” and “i love cats” should be closer because ‘like’ is closer to ‘love’
than ‘hate’ (they are more semantically similar)
Can we vectorize language without having too many 0s, AND take into account semantic
similarity?

THESE WORD VECTORS (EMBEDDINGS)…

Study These Flashcards

Show us which words are often used in the same contexts as other words
Allows us to compare words at the individual level by measuring their cosine similarity, e.g.
- dog and cat will be similar
– I take my X to the vet
– I have to go home and feed my X

Measuring similarity

Study These Flashcards

Given 2 target words v and w

We’ll need a way to measure their similarity

Most measures of vector similarity are based on the:
Dot product or innner product from linear algebra

High when two vectors have large values in the same dimensions

Low (in fact 0) for orthogonal vectors with zeros in complementary distribution

GENSIM

A python library for word embeddings Uses a neural net model (or deep lerarning) to learn to predict a word when being given the surrounding words as an input Contains a function called Word2Vec (word to vector), which - takes a corpus, - turns it into a set of word embeddings Contains a function called Doc2Vec (Document to Vector), which - takes a corpus - turns it into a set of document embeddings (There is an R-version

GENSIM - SKIP-GRAM VS CBOW

Skip-gram Pro: makes it possible for a word to have different meanings (e.g. Apple the company, apple the fruit) Con: But has less strong signal Continuous Bag of Words (CBOW) Pro: stronger signal Con: The meaning of an individual word will lie somewhere in between the different meanings it has (spatially/geometrically)

WHAT IS THE END RESULT?

A model that has embedded the semantic similarity of words. We can use this to: 1. Find the similarity between words and sentences 2. Classify existing text data by finding ”clusters” of texts with high semantic similarity (unsupervised machine learning) 3. Give it new, unseen data, and classify that

TOPIC MODELLING: FROM WORDS TO DOCUMENTS TO TOPICS

If we think of each document as consisting of the sum of each of its word-vectors, we can also compare documents. Topic modelling looks for clusters of documents, i.e. the vectors that are closest to each other. Topic modelling can help us * Confirm our own ideas of what is in the text * Discover new things that are in the text, that we did not know about before

RUNNING A TOPIC MODEL

A topic model is given a set of documents and told how many (n) topics to cluster for. We do NOT tell it what the topics are, it figures that out for us as well as it can. Because we do not tell it in advance what the topics are, and because we do not teach it by showing it true examples of the topics, we call this unsupervised It returns n lists of words, each of these lists is a topic. The words are ordered in in terms of their cohesiveness to the topic. (I.e. how special is this word to this topic?) A topic model gives you a topic cohesiveness score which tells you: * For each word, how special is this word to this topic? * For each topic, it gives us the total cohesiveness of that topic

WHAT DID WE LEARN ABOUT PRE-LLM NLP

Bayes Law can help us calculate: - how well a set of pre-defined labels can predict a document’s class - which features are most informative - if we already have a ground truth Vectorization of documents can help us: - project documents into space - this lets us calculate distances and similarities - and lets us cluster documents into topics Between these two methods, we can * Better identify specific kinds of classes of text * Describe which words or themes appear in all of them * Classify or find similar documents, and see what makes them similar We can evaluate our models in different ways: - Face value evaluation – does this seem reasonable at all?? - Bayes gives us a neat Confusion Matrix - Clustering gives us topic coherence This opens up a wealth of possibilities for exploring texts, and validating hypotheses about the contents of HUGE corpora of text

LARGE LANGUAGE MODELS IN NLP: How do we use LLMs for Natural Language Processing? The same things!

1. Feature extraction (the small things in text we might be interested in) 2. Classification (the overall text) 3. Vectorization But just slightly differently…

LLM VECTORIZATION

We start with LLM vectorization because we won’t spend much time on it. The vectors can be used for the same things as “old school” vectors: 1. Similarity measures 2. Clustering 3. Classification through clustering They are just better at it because their semantic representations are better.

THE BRITTLENESS OF PROMPTING

Prompt sensitivity: slight variations in the phrasing of the prompt that would not make much difference to a human can lead to very different LLM output (Lu et al. 2021; Zhao et al. 2021) Of 30 studied LLMs, “all models show significant sensitivity to the formatting of prompt, the particular choice of in-context examples, and the number of in-context examples across all scenarios and for all metrics.” (Liang et al. 2022)

EMERGENT PROPERTIES?

Emergent property/ability: A property that a model exhibits despite not being explicitly trained for it. E.g. Chain of Thought prompting

LLM FEATURE EXTRACTION

Unlike with programming languages, LLMs work on natural language. When we want features extracted, we have to explain to it what we want in words. When we do feature extraction with LLMs, we: 1. Tell the LLM what to look for 2. Tell the LLM how we want the answer 3. Give the LLM the data we want feature extracted

LLM CLASSIFICATION

LLMs can also classify text for us. Instead of telling it what to extract, we tell it what it should classify based on. In other words, LLMs can be used in the whole process: 1. Give it next text, have it extract features. 2. Feed those features back to LLMs as classes 3. Give it data, and ask it to classify those data

IPHONE REVIEWS: ACITIVYT Download the datafiles containing 100 reviews that contain the word ‘iphone’ from last time. Let’s try to answer the same questions we did last time: 1. What are the differences between positive and negative reviews? 2. What characterizes them? 3. Can we tell positive and negative reviews apart from one another?

Examples: 1. Adjectives, specific words 2. Numeric evaluations of content 3. Counts of occurrences 4. Overall classes

TIPS FOR USING LLMS FOR DATA ANALYSIS

Words matter! If GPT is not doing what you want it to do, think of all the ways in which you can say what you are saying, just in a different way. Practice makes perfect. Using LLMs is easy to get started on, but hard to get really good at. Building good habits takes time. Use the playground first, then do rest programmatically. The power of LLMs is working withdatasets much larger than what we can feed into the model at one time. The only way to do that is through the Power of Coding.

L6: NLP - Introduction to Natural Language Processing Flashcards

(38 cards)