NLTK Flashcards
What is spaCy?
spaCy is an advanced Natural Language Processing (NLP) library designed for efficiency and production use.
🔹 Why spaCy?
Faster than NLTK
Supports deep learning with TensorFlow & PyTorch
Handles large text data efficiently
🔹 Features:
Tokenization
Named Entity Recognition (NER)
Part-of-Speech (POS) tagging
Dependency Parsing
How to import nltk?
import nltk
How to import stopwords?
from nltk.corpus import stopwords
Remove Stopwords
stop_words = set(stopwords.words(‘english’))
words = [word for word in words if word not in stop_words]
How to import tokenizer ?
from nltk.tokenize import word_tokenize
# Tokenization
words = word_tokenize(text) # text contains text data
How to import lemmatizer?
from nltk.stem import WordNetLemmatizer
Lemmatization
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(word) for word in words]
How to do text Lowercasing ?
text = text.lower() # Lowercasing
How to remove numbers from text?
text = re.sub(r’\d+’, ‘’, text) # Remove numbers
How to remove punctuations from text?
text = text.translate(str.maketrans(‘’, ‘’, string.punctuation)) # Remove punctuation
How to remove extra space from text?
text = ‘ ‘.join(text.split()) # Remove extra spaces
How to import TF-IDF Vectorizer?
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000) # here vector is initiated to
X=vectorizer.fit_transform(df[‘cleaned_message’]).toarray()
y = df[‘label_spam’]