- When reading html eliminate tags - Put all letters in lowercase - Sometimes it is a good idea to remove punctuation - Sometimes it is a good idea to remove words like "are, for, a, the"

- Letters in Unicode are represented by numbers, but they if they are treated as numbers that can affect the models - The are many ways to represent text info - If you want a graph based model to discover insights you want to represent words as node with relations - If you want to recognize spam or text sentiment use bag of words - Text generation or translation use word-to-vec

- Create a statistical or machine learning model

- Is something that represents a unique concept like a dog

- Eliminate words like "are","the" that don't give extra info - nltk has a list of stop words - [word for word in querywords if word.lower() not in stopwords]

Natural Language Processing Flashcards by Alvaro Pinzon Cortes

Pipeline of natural language processing

1) Text processing
2) Feature extraction
3) Modeling

How well did you know this?

Not at all

Perfectly

Text processing

When reading html eliminate tags
Put all letters in lowercase
Sometimes it is a good idea to remove punctuation
Sometimes it is a good idea to remove words like “are, for, a, the”

How well did you know this?

Not at all

Perfectly

Feature extraction

Letters in Unicode are represented by numbers, but they if they are treated as numbers that can affect the models
The are many ways to represent text info
If you want a graph based model to discover insights you want to represent words as node with relations
If you want to recognize spam or text sentiment use bag of words
Text generation or translation use word-to-vec

How well did you know this?

Not at all

Perfectly

Modeling

Create a statistical or machine learning model

How well did you know this?

Not at all

Perfectly

How to read a file in python

with open("hola.txt", "r") as f:
text = f.read()

How well did you know this?

Not at all

Perfectly

How to read tabular data or csv

You can use panda

- df = pd.read_csv(“hola.csv”)

How well did you know this?

Not at all

Perfectly

How to get a website or a file in the web?

import requests
# Fetch a web page
r = requests.get("https://www.udacity.com/courses/all")

How well did you know this?

Not at all

Perfectly

How to clean the text from a website?

Use a library
from bs4 import BeautifulSoup

# Remove HTML tags using Beautiful Soup library
soup = BeautifulSoup(r.text, "html5lib")
print(soup.get_text())

How well did you know this?

Not at all

Perfectly

Tips for text cleaning

All letters to lower case

- In document classification or clustering eliminate punctuation

How well did you know this?

Not at all

Perfectly

How to eliminate punctuation

Use the regular expressions library “re”
import re
Replace punctuation with a space

How well did you know this?

Not at all

Perfectly

Useful libraries

NLTK
BeautifulSoup
re

How well did you know this?

Not at all

Perfectly

What is a token

Is something that represents a unique concept like a dog

How well did you know this?

Not at all

Perfectly

Tokenization with NLTK

Token words with word_tokenize. It is smarter than split
Tokenize sentences with sent_tokenize
To tokenize twitters

How well did you know this?

Not at all

Perfectly

Stop word removal

Eliminate words like “are”,”the” that don’t give extra info
nltk has a list of stop words
[word for word in querywords if word.lower() not in stopwords]

How well did you know this?

Not at all

Perfectly

Part-of-speech tagging

It is helpful on some applications to classify words by verbs, nouns, etc
Use NLTK pos_tag

How well did you know this?

Not at all

Perfectly

Named entity recognition

Study These Flashcards

Classify a noun by the type of entity: person, organization, government, etc
Use NLTK ne_chunk

Stemming and Lemmatization

Study These Flashcards

Methods to turn variations of a word to a stem or root word. Example: Started&raquo_space; Start
In stemming the result is not always a real word, but it is more efficient
NLTK has a stemming method PorterStemmer.stem()
lemmatization is more computationally consuming because it uses a dictionary, but the result word is a real word
NLTK has a lemmatization method WordNetLemmatizer.lemmatize() and it is setup by default to nouns

Lesson summary

Study These Flashcards

1) Normalize
2) Tokenize
3) Remove stop words
4) Stem / Lemmatize

Bag of words

Study These Flashcards

Interpret each document as a group of words without order
To compare how similar are two bag of words use dot product and cosine similarity
The document is a vector with the frequency of each word
To compare documents make a matrix, columns are word frequency and rows are the documents
Each word has the same importance

TF-IDF

Study These Flashcards

Highlight words that are more unique to a document

- tdidf = td*idf = count(d,t)/|d| * log(|D|/|dED:tEd|)

Word2Vec

Study These Flashcards

The idea is a model able to predict a word given neighboring words (Continuous Bag of Words) or given word predict the neighbors (Continuous Skipgram-model)

GloVe

Study These Flashcards

Use a co-occurrence probability matrix of words of a document.
P(Water | Ice) = 0.2

language model

Study These Flashcards

A language model captures the distributional statistics of words. In its most basic form, we take each unique word in a corpus, i, and count how many times it occurs.

Bigram Model

Study These Flashcards

Matrix of a corpus that tells the probability of a word to occurs given another previous word
It is used for generating text

Regex eliminate punctuation

- "\W" eliminate non letter characters

Natural Language Processing Flashcards

(25 cards)