L6: NLP - Introduction to Natural Language Processing Flashcards
TEXT DATA, TWITTER EXAMPLE
{“id”: “1579502253039550466”, “conversation_id”: “1579238912039403521”, “entities”:
{“mentions”: [{“start”: 0, “end”: 14, “username”: “SaucyforTesla”, “id”:
“1334784052386119680”}], “urls”: [{“start”: 171, “end”: 194, “url”: “https://t.co/qrfOZy6wQe”,
“expanded_url”: “https://twitter.com/messages/compose?recipient_id=10850192”,
“display_url”: “twitter.com/messages/compo\u2026”, “status”: 404, “unwound_url”:
“https://twitter.com/messages/compose?recipient_id=10850192”}]}, “author_id”:
“10850192”, “public_metrics”: {“retweet_count”: 0, “reply_count”: 0, “like_count”: 0,
“quote_count”: 0}, “text”: “@SaucyforTesla Hi, Larry. Our team would be happy to discuss
your ideas in more detail. When you have a moment, please feel free to DM us with
additional details. -Rachel https://t.co/qrfOZy6wQe”, “referenced_tweets”: [{“type”:
“replied_to”, “id”: “1579238912039403521”}], “created_at”: “2022-10-10T16:00:57.000Z”,
“edit_history_tweet_ids”: [“1579502253039550466”], “full_text”: “@SaucyforTesla Hi, Larry.
Our team would be happy to discuss your ideas in more detail. When you have a moment,
please feel free to DM us with additional details. -Rachel https://t.co/qrfOZy6wQe”,
“unixTime”: 1665410457.0, “twitter_name”: “GM”}
TEXT DATA, AMAZON REVIEW EXAMPLE
{
‘overall’: 5.0,
‘verified’: True,
‘reviewTime’: ‘03 21, 2014’,
‘reviewerID’: ‘A2HQAG6N6J6WF7’,
‘asin’: ‘6073894996’,
‘reviewerName’: ‘Speed3’,
‘reviewText’: ‘Plug it into your accessory outlet and charge your USB cabled device. I use it for charging my iPhone while I
drive to and from work.’,
‘summary’: ‘It works’,
‘unixReviewTime’: 1395360000,
}
Corpora and Documents
:
* Corpora are typically thought of as meaningfully cohesive collections of Documents.
Often a class of documents, e.g.
* From the same source (e.g. amazon reviews)
* From different sources, but on the same topic (different historical sources about
WW2)
* Filtered by some set of criteria (tweets containing the word “ham”, or being a 5-
star review)
Documents are uniform collections of
- Words & sentences (potential features)
- Auxiliary data, e.g. time, place, and other metainfo (potential features)
DERIVING DATA FROM DOCUMENTS
From unstructured data to meaningful features
Often we want more information than what is in the document text or meta data. E.g.
* How many words/sentences are in it?
* How long ago was it written?
* How long is the text?
* Is the text generally happy or unhappy? What is the sentiment of the text?
This information can often be derived from the documents. We do this by “applying
functions to each document”.
TYPICAL FUNCTIONS WE USE TO EXPLORE TEXT DATA
How many words do they write?
How many sentences?
What is the reading difficulty?
Are people generally happy or unhappy?
Anything else anyone can think of?
LIX SCORE
Lix is a useful measure for calculating
how difficult a word is. It measures
How long are sentences
How large a percentage of the words
are longer than 6 characters
Formally,
Word_count / sentence_count +
words_longer_than_6 * 100 /
word_count
SENTIMENT ANALYSIS
Sentiment Analysis is an evaluation of how positive or negative a text is.
The simple way to do it is to have two lists of words:
* Positive words
* Negative words
And then simply count how many of each there are in a text, and add them up.
Alternatively, some words are more negative or positive than others. If we give each of
them a negativity score or a positivity score, we can do a more nuanced analysis of the
sentiment
FROM DESCRIPTION TO CLASSIFICATION (AND PREDICTION)
We have a variety of features now. How do we think of features as belonging to a class,
and therefore potentially helpful in terms of making predictions?
The data-related questions:
How do we articulate regularities/relevant features?
How do we spot them?
How do we test them?
The learning related questions?
How do we make predictions based on ”known” feature-class relationships?
To what extent does each feature predict belonging to a specific class
THE FUNDAMENTAL QUESTIONS IN
MACHINE LEARNING (INCL NLP):
- What makes something something, in a way that sets it apart from all the other things it
could be?
* What makes a duck a duck, and not a goose?
* What makes a positive review a positive review, and not a lukewarm review?
* What makes a customer review negative about service, but positive about food, and
not vice versa? - How can we use those particularities about the things we are looking at, and make
them useful? - Either to better understand the things we are looking at,
- And/or to make predictions about when something is indeed something, and not
something else
CORE CONCEPTS: FEATURES AND CLASSES
Features: Specific (processed
data) that define a thing both in its
own right, and in contrast to other
things
Bird features:
* Length of beak
* Shape of beak
* Color of beak
* Body posture
* Color of feathers
* Size
Etc
Classes: collections of things with
shared features that we want to
think of as a “something
Bird classes:
* Hen
* Duck
* Goose
etc
WHAT’S HARD ABOUT THAT?
Features and classes seem easy
enough, right?
For birds, biologists have already
done most of the work for us and
defined features, like birds. They
have structured the data for us.
When we work with unstructured
text data, we often need to identify
and define these ourselves.
{
‘overall’: 5.0,
‘verified’: True,
‘reviewTime’: ‘03 21, 2014’,
‘reviewerID’: ‘A2HQAG6N6J6WF7’,
‘asin’: ‘6073894996’,
‘reviewerName’: ‘Speed3’,
‘reviewText’: ‘Plug it into your accessory
outlet and charge your USB cabled
device. I use it for charging my iPhone
while I drive to and from work.’,
‘summary’: ‘It works’,
‘unixReviewTime’: 1395360000,
}
HOW?
The process of
1. Defining
2. Identifying, and
3. Extracting
features and classes from your
data and structuring it yourself is
often the most difficult part of the
process.
And it is what we will mostly talk
about today.
BAYES LAW
P(A l B) = (P(B l A)P(A) / P(B)
P(A | B) = What is a probability that a document belongs to class A, given that
feature B appears?
P(B | A) = how often do we see the feature in Class A?
P(A) = how often do we see Class A in the corpus in general?
P(B) = how frequent is the feature in general
USING BAYES LAW ON A DATASET
- We take our input, extract features
from it. - We add KNOWN class label(s) to
those features. - We tell our Bayesian classifier,
“these features indicate that the
item that produced the featureset
belongs to class X” - The machine updates the model,
according to Bayes law.
If we give it a new featureset, it makes
predictions based on what it knows.
This is a SUPERVISED machine
learning approach.
It is supervised because:
1. Classes are known in advance
2. We know which documents
belong to which class
3. We tell the model how documents
and features are related.
We will do some Unsupervised
EVALUATING A MODEL
A confusion matrix
Evaluation: Precision, Recall, accuracy