NLTK Flashcards

Question 1

Q

What is spaCy?

Answer

A

spaCy is an advanced Natural Language Processing (NLP) library designed for efficiency and production use.

🔹 Why spaCy?
Faster than NLTK
Supports deep learning with TensorFlow & PyTorch
Handles large text data efficiently

🔹 Features:
Tokenization
Named Entity Recognition (NER)
Part-of-Speech (POS) tagging
Dependency Parsing

Question 2

Q

How to import nltk?

Answer

A

import nltk

Question 3

Q

How to import stopwords?

Answer

A

from nltk.corpus import stopwords

Remove Stopwords
stop_words = set(stopwords.words(‘english’))
words = [word for word in words if word not in stop_words]

Question 4

Q

How to import tokenizer ?

Answer

A

from nltk.tokenize import word_tokenize
# Tokenization
words = word_tokenize(text) # text contains text data

Question 5

Q

How to import lemmatizer?

Answer

A

from nltk.stem import WordNetLemmatizer

Lemmatization
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(word) for word in words]

Question 6

Q

How to do text Lowercasing ?

Answer

A

text = text.lower() # Lowercasing

Question 7

Q

How to remove numbers from text?

Answer

A

text = re.sub(r’\d+’, ‘’, text) # Remove numbers

Question 8

Q

How to remove punctuations from text?

Answer

A

text = text.translate(str.maketrans(‘’, ‘’, string.punctuation)) # Remove punctuation

Question 9

Q

How to remove extra space from text?

Answer

A

text = ‘ ‘.join(text.split()) # Remove extra spaces

Question 10

Q

How to import TF-IDF Vectorizer?

Answer

A

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000) # here vector is initiated to
X=vectorizer.fit_transform(df[‘cleaned_message’]).toarray()
y = df[‘label_spam’]