w1 L2 the nlp pipeline and tokenization Flashcards

*note lecture 1 is just an intro and nothing too important is covered

1
Q

recall the steps in the nlp pipeline

A

analyze the task + define framework

preprocess data + get insights

define relevant information + extract from data

select appropriate algorithm + implement

apply it in practice + test and evaluate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is tokenization

A

the task of seperating out words in raw text, note word definitions are language dependent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what are regular expressions

A

a way of matching patterns,

ie split replace find… based on pattern
these patterns can be complex

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what are some regular expressions

A

Special characters: . ^ $ * ? {m}\ [] | ?:
Sequences: \b \B\d \D \s \w \W
Flags: re.I, re.IGNORECASE
Functions: re.compile, re.search, re.match, re.split, re.sub

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what are the differences between tokenization in LLMs and PLMs

A

LLM tokenization is purly statistical
larger vocab size

pretrained models look at text in pre trianing paradigmn and try to understand language cues

LLMs are more generalizeable scalable and robust tokenization so they can handle more different applicatiosn

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is zipfs law and what does it imply about stop words

A

a statistical principle that describes the distribution of elements in a dataset

the frequency of any word is inversely proportional to its rank in the frequency table

the zipf curve shows that the most frequent words and vast majority of words are stop words, these words are uninformative so they can be dropped

*might be worth it to rmeber the formula

How well did you know this?
1
Not at all
2
3
4
5
Perfectly