C1 Flashcards
3 types of text mining tasks
- text classification/clustering: assign a category or cluster per document
- sequence labelling: assign a category per word in a text
- text-to-text generation: input is text, output is text
4 challenges of text data
- text data is unstructured
- text data can be multi-lingual
- text data is noisy
- language is infinite
bag of words model
- text as classification object
- each word becomes a feature
- each term in collection becomes a dimension in the vector space
- only a few of all words occur in a given document => high dimensional, sparse vectors
word embeddings
- lower dimensional and dense vector space
- dimensions are learnt from data (not individually interpretable)
- similar words are close to each other in the space
evaluation metrics
- precision: proportion of the assigned labels that are correct
- recall: proportion of the relevant labels that were assigned
precision versus recall bij terroristen schatten
precision: hoe veel geschatte terroristen waren niet echt terrorist
recall: hoeveel terroristen heb je gemist door ze niet als terrorist te schatten
text mining
automatic extraction of knowledge from text
text mining pipeline for discovering side effects for hypertension medications
- Filter the data (retrieve relevant messages)
- Process the data (clean, anonymize)
- Create training data (human labelling)
- Identify medication names (named entity recognition)
- Identify side effects (named entity recognition)
- External knowledge needed (ontology)
- Relations between medications and side effects (relation extraction)
Zipf’s law
Given a text collection, the frequency of any word is inversely proportional to its rank in the frequency table
extrinsic evaluation
evaluation of complete application
- human vs. automatic
- are humans helped/satisfied by the results?
intrinsic evaluation
evaluation of the components: ground truth labels needed
- existing labels in the data
- human-assigned labels in the data