Lecture 2 Flashcards
Tokenization
Process of bewakinf down the text into tokens (smaller chunks -words including punctutation)
Each language has it is own stop words
true
Stemming
Choping off letters from end of words until stem is reached
Lancaster stemming
Considered with chopping off the word as much as possible (Sacrifice accuracy for speed)
lemmatization
takes into conisderation root of word
Lemmatizer allow you to consider the part of speech of a word unlike the stem (lemmtizing a noun is different than lmmetizing a verb)
True
text is a list of sent which is seq of words
true
Pattern something that repeats itslef - find and analyze repitions
true
If you want to find what text about : first you need to find words that are repeaeted and that are not stop words
true
Patterns is about words which conversation woll b centered
true
Regular expressions
Language to find paterns
Disjunction is the brackets and inside it is used as or
true
[A-Z]
Any uppercase of the alphabet
[^abc] means not small a not small b not c
true
[a^b]
a or b
^the
Searching for the pattern the
abc?
means abc or ab
[abc]?
means a or b or c
- 0 or more occurences
+ 1 or more occurences
. wild card (any
True
+ is greedy
* is greedy
True
$
at end
search
search until finds first match
Match and serch must be called with group to show result
true
Split split based on char
true