Words Flashcards
What are some problems with natural langauge?
There is lots of ambiguity from identical word forms
There is also dependency on punctuation or intonation
What types of texts exist?
Formal News
Polemic News (argumentative)
Speech
Historic, Poetic, Musical
Social Media
What is a sentence?
A unit of written language
What is an utterance?
It is a unit of spoken language
What is a word form?
It is the inflected form as it appears in the corpus
What is a lemma?
It is an abstract form shared by word forms having the same stem, POS, word sense
What are function words?
Indicate the grammatical relationship between terms but have little topical meaning
What are types?
They are a number of distinct words in a corpus
What are tokens?
It is the collection of all words
What are some lexical analysis steps we can take?
Stripping punctuation, folding cases, removing function words, lemmatising and stemming text, taking an index for each of the words
How much of text is function words?
They account for up to 60 percent of text
What does repetition signal?
It signals intention
What do wordclouds provide?
They provide visual representation of statistical summary
What is tokenization?
It is the process of turning a stream of characters into a sequence of words
What is a token?
A token is a lexical construct that can be assigned grammatical and semantic roles