w1 L2 the nlp pipeline and tokenization Flashcards
*note lecture 1 is just an intro and nothing too important is covered
recall the steps in the nlp pipeline
analyze the task + define framework
preprocess data + get insights
define relevant information + extract from data
select appropriate algorithm + implement
apply it in practice + test and evaluate
what is tokenization
the task of seperating out words in raw text, note word definitions are language dependent
what are regular expressions
a way of matching patterns,
ie split replace find… based on pattern
these patterns can be complex
what are some regular expressions
Special characters: . ^ $ * ? {m}\ [] | ?:
Sequences: \b \B\d \D \s \w \W
Flags: re.I, re.IGNORECASE
Functions: re.compile, re.search, re.match, re.split, re.sub
what are the differences between tokenization in LLMs and PLMs
LLM tokenization is purly statistical
larger vocab size
pretrained models look at text in pre trianing paradigmn and try to understand language cues
LLMs are more generalizeable scalable and robust tokenization so they can handle more different applicatiosn
what is zipfs law and what does it imply about stop words
a statistical principle that describes the distribution of elements in a dataset
the frequency of any word is inversely proportional to its rank in the frequency table
the zipf curve shows that the most frequent words and vast majority of words are stop words, these words are uninformative so they can be dropped
*might be worth it to rmeber the formula