NLP Flashcards
How to create a Doc object in Spacy?
import spacy
nlp = spacy.load(‘en_core_web_sm’)
doc = nlp(u’This is some text’)
What is a span in Spacy?
span is a slice of Doc object, Doc[start:end]
What are noun_chunks in Spacy?
base noun phrases
How to visualize in Spacy?
from spacy import displacy
displacy. render(doc, style=’dep|ent’, jupyter=True, options={‘distance’: 110})
displacy. serve(doc, style=’dep’)
127. 0.0.1:port
How to get a list of stopwords in Spacy?
import spacy
nlp = spacy.load(‘en_core_web_sm’)
print(nlp.Defaults.stop_words)
How to check if a word is a stop word in Spacy?
nlp.vocab[‘word’].is_stop
How to add a stop word in Spacy?
nlp. Defaults.stop_words.add(‘btw’)
nlp. vocab[‘btw’].is_stop = True
How to remove a stop word in Spacy?
nlp. Defaults.stop_words.remove(‘btw’)
nlp. vocab[‘btw’].is_stop = False
How to build a library of token patterns in Spacy?
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
pattern1 = [{‘LOWER’: ‘solarpower’}]
pattern2 = [{‘LOWER’: ‘solar’}, {‘IS_PUNCT’: True, ‘OP’:’*’}, {‘LOWER’: ‘power’}]
matcher.add(‘SolarPower’, None, pattern1, pattern2)
found_matches = matcher(doc)
print(found_matches)
How to use a matcher for terminology lists in Spacy?
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
phrase_list = [‘voodoo economics’, ‘supply-side economics’, ‘trickle-down economics’, ‘free-market economics’]
phrase_patterns = [nlp(text) for text in phrase_list]
matcher.add(‘VoodooEconomics’, None, *phrase_patterns)
matches = matcher(doc)
How to count POS frequency in a text in Spacy?
POS_counts = doc.count_by(spacy.attrs.POS)
for k,v in sorted(POS_counts.items()):
print(f’{k}. {doc.vocab[k].text:{5}}: {v}’)
How to add a named entity in Spacy?
from spacy.tokens import Span
ORG = doc.vocab.strings[u’ORG’]
new_ent = Span(doc, start, end, label=ORG)
doc.ents = list(doc.ents) + [new_ent]
How to add named entities to all matching spans in Spacy?
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
phrase_list = [‘vacuum cleaner’, ‘vacuum-cleaner’]
phrase_patterns = [nlp(text) for text in phrase_list]
matcher.add(‘newproduct’, None, *phrase_patterns)
matches = matcher(doc)
from spacy.tokens import Span
PROD = doc.vocab.strings[u’PRODUCT’]
new_ents = [Span(doc, match[1],match[2],label=PROD) for match in matches]
doc.ents = list(doc.ents) + new_ents
How to add a new rule to pipeline in Spacy?
def set_custom_boundaries(doc): for token in doc[:-1]: if token.text == ';': doc[token.i+1].is_sent_start = True return doc
nlp.add_pipe(set_custom_boundaries, before=’parser’)
How to change segmentation rules in Spacy?
from spacy.pipeline import SentenceSegmenter
def split_on_newlines(doc): start = 0 seen_newline = False for word in doc: if seen_newline: yield doc[start:word.i] start = word.i seen_newline = False elif word.text.startswith('\n'): # handles multiple occurrences seen_newline = True yield doc[start:] # handles the last group of tokens
sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_newlines)
nlp.add_pipe(sbd)