Human and sentiment Flashcards
Tokenization
”The process of splitting text into meaningful elements is called tokenization.”
(Eg. splitting strings into lists)
Lemmatization
Converting each token into a representative lemma. For example, ‘go’ is the English lemma for words such as ‘gone’, ‘going’, and ‘went’.”
(Roden af ordet)
Part of Speech (POS): Syntactic categories of a word
rt/NN polakpolly/RB yesterday/NN i/FW had/VBD to/TO teach/VB my/PRP$ students/NNS in/IN under/IN hours/NNS what/WP the/DT eu/NN was/VBD and/CC why/WRB brexit/NN was/VBD happening/VBG it/PRP seemed/VBD like/IN an/DT im/NN
Coreference
Words with same meaning
tf–idf
In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
Notation of td-idf
https://monkeylearn.com/static/dc103a13ad766591be11bca8774dfc02/e3135/image3.png
INFORMATION. RETRIEVAL
Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections
The bag-of-words model
A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:
A vocabulary of known words.
A measure of the presence of known words.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.
Stylometric analysis
Characterising writing style
Topic models
Finding the topics of the words in text