Natural Language Processing Flashcards
Pipeline of natural language processing
1) Text processing
2) Feature extraction
3) Modeling
Text processing
- When reading html eliminate tags
- Put all letters in lowercase
- Sometimes it is a good idea to remove punctuation
- Sometimes it is a good idea to remove words like “are, for, a, the”
Feature extraction
- Letters in Unicode are represented by numbers, but they if they are treated as numbers that can affect the models
- The are many ways to represent text info
- If you want a graph based model to discover insights you want to represent words as node with relations
- If you want to recognize spam or text sentiment use bag of words
- Text generation or translation use word-to-vec
Modeling
- Create a statistical or machine learning model
How to read a file in python
with open("hola.txt", "r") as f: text = f.read()
How to read tabular data or csv
- You can use panda
- df = pd.read_csv(“hola.csv”)
How to get a website or a file in the web?
import requests # Fetch a web page r = requests.get("https://www.udacity.com/courses/all")
How to clean the text from a website?
- Use a library
from bs4 import BeautifulSoup
# Remove HTML tags using Beautiful Soup library soup = BeautifulSoup(r.text, "html5lib") print(soup.get_text())
Tips for text cleaning
- All letters to lower case
- In document classification or clustering eliminate punctuation
How to eliminate punctuation
- Use the regular expressions library “re”
- import re
- Replace punctuation with a space
Useful libraries
- NLTK
- BeautifulSoup
- re
What is a token
- Is something that represents a unique concept like a dog
Tokenization with NLTK
- Token words with word_tokenize. It is smarter than split
- Tokenize sentences with sent_tokenize
- To tokenize twitters
Stop word removal
- Eliminate words like “are”,”the” that don’t give extra info
- nltk has a list of stop words
- [word for word in querywords if word.lower() not in stopwords]
Part-of-speech tagging
- It is helpful on some applications to classify words by verbs, nouns, etc
- Use NLTK pos_tag
Named entity recognition
- Classify a noun by the type of entity: person, organization, government, etc
- Use NLTK ne_chunk
Stemming and Lemmatization
- Methods to turn variations of a word to a stem or root word. Example: Started»_space; Start
- In stemming the result is not always a real word, but it is more efficient
- NLTK has a stemming method PorterStemmer.stem()
- lemmatization is more computationally consuming because it uses a dictionary, but the result word is a real word
- NLTK has a lemmatization method WordNetLemmatizer.lemmatize() and it is setup by default to nouns
Lesson summary
1) Normalize
2) Tokenize
3) Remove stop words
4) Stem / Lemmatize
Bag of words
- Interpret each document as a group of words without order
- To compare how similar are two bag of words use dot product and cosine similarity
- The document is a vector with the frequency of each word
- To compare documents make a matrix, columns are word frequency and rows are the documents
- Each word has the same importance
TF-IDF
- Highlight words that are more unique to a document
- tdidf = td*idf = count(d,t)/|d| * log(|D|/|dED:tEd|)
Word2Vec
- The idea is a model able to predict a word given neighboring words (Continuous Bag of Words) or given word predict the neighbors (Continuous Skipgram-model)
GloVe
- Use a co-occurrence probability matrix of words of a document.
- P(Water | Ice) = 0.2
language model
A language model captures the distributional statistics of words. In its most basic form, we take each unique word in a corpus, i, and count how many times it occurs.
Bigram Model
- Matrix of a corpus that tells the probability of a word to occurs given another previous word
- It is used for generating text
Regex eliminate punctuation
- “\W” eliminate non letter characters