Natural Language Processing Flashcards
1
Q
Pipeline of natural language processing
A
1) Text processing
2) Feature extraction
3) Modeling
2
Q
Text processing
A
- When reading html eliminate tags
- Put all letters in lowercase
- Sometimes it is a good idea to remove punctuation
- Sometimes it is a good idea to remove words like “are, for, a, the”
3
Q
Feature extraction
A
- Letters in Unicode are represented by numbers, but they if they are treated as numbers that can affect the models
- The are many ways to represent text info
- If you want a graph based model to discover insights you want to represent words as node with relations
- If you want to recognize spam or text sentiment use bag of words
- Text generation or translation use word-to-vec
4
Q
Modeling
A
- Create a statistical or machine learning model
5
Q
How to read a file in python
A
with open("hola.txt", "r") as f: text = f.read()
6
Q
How to read tabular data or csv
A
- You can use panda
- df = pd.read_csv(“hola.csv”)
7
Q
How to get a website or a file in the web?
A
import requests # Fetch a web page r = requests.get("https://www.udacity.com/courses/all")
8
Q
How to clean the text from a website?
A
- Use a library
from bs4 import BeautifulSoup
# Remove HTML tags using Beautiful Soup library soup = BeautifulSoup(r.text, "html5lib") print(soup.get_text())
9
Q
Tips for text cleaning
A
- All letters to lower case
- In document classification or clustering eliminate punctuation
10
Q
How to eliminate punctuation
A
- Use the regular expressions library “re”
- import re
- Replace punctuation with a space
11
Q
Useful libraries
A
- NLTK
- BeautifulSoup
- re
12
Q
What is a token
A
- Is something that represents a unique concept like a dog
13
Q
Tokenization with NLTK
A
- Token words with word_tokenize. It is smarter than split
- Tokenize sentences with sent_tokenize
- To tokenize twitters
14
Q
Stop word removal
A
- Eliminate words like “are”,”the” that don’t give extra info
- nltk has a list of stop words
- [word for word in querywords if word.lower() not in stopwords]
15
Q
Part-of-speech tagging
A
- It is helpful on some applications to classify words by verbs, nouns, etc
- Use NLTK pos_tag