w1 L2 the nlp pipeline and tokenization Flashcards

Question 1

Q

recall the steps in the nlp pipeline

Answer

A

analyze the task + define framework

preprocess data + get insights

define relevant information + extract from data

select appropriate algorithm + implement

apply it in practice + test and evaluate

Question 2

Q

what is tokenization

Answer

A

the task of seperating out words in raw text, note word definitions are language dependent

Question 3

Q

what are regular expressions

Answer

A

a way of matching patterns,

ie split replace find… based on pattern
these patterns can be complex

Question 4

Q

what are some regular expressions

Answer

A

Special characters: . ^ $ * ? {m}\ [] | ?:
Sequences: \b \B\d \D \s \w \W
Flags: re.I, re.IGNORECASE
Functions: re.compile, re.search, re.match, re.split, re.sub

Question 5

Q

what are the differences between tokenization in LLMs and PLMs

Answer

A

LLM tokenization is purly statistical
larger vocab size

pretrained models look at text in pre trianing paradigmn and try to understand language cues

LLMs are more generalizeable scalable and robust tokenization so they can handle more different applicatiosn

Question 6

Q

what is zipfs law and what does it imply about stop words

Answer

A

a statistical principle that describes the distribution of elements in a dataset

the frequency of any word is inversely proportional to its rank in the frequency table

the zipf curve shows that the most frequent words and vast majority of words are stop words, these words are uninformative so they can be dropped

*might be worth it to rmeber the formula

w1 L2 the nlp pipeline and tokenization Flashcards

*note lecture 1 is just an intro and nothing too important is covered (6 cards)