C5 Flashcards
How to get example data?
- use existing labelled data
- create new labelled data
3 types of existing labelled data
- benchmark data
- existing human labels
- labelled user-generated content
benchmark data and pros and cons
used to evaluate and compare methods, often created in the context of shared tasks
advantages:
- high-quality
- re-usable
- compare results to others
disadvantages:
- not availale for every specific problem and data type
existing human labels with pros and cons
labels that were added to items by humans, but not originally created for training machine learning models, eg.
the international patent classification system (millions of patents manually classified in a hierarchical classification system by patent experts)
advantages:
- high-quality
- potenially large
- often freely available
disadvantages:
- not available for every specific problem and data type
- not always directly suitable for training classifiers
labelled user generated content with pros and cons
- hashtags on Twitter (eg. learning sarcasm #not)
- scores and aspects in customer reviews (to learn sentiment and opinion)
- likes of posts to learn which comments are the most interesting
advantages:
- potentially large
- human-created
- freely available, depending on platform
disadvantages:
- noisy: often inconsistent
- may be low-quality
- indirect signal
5 steps to create new labelled data
- create a sample of items
- define a set of categories
- write annotation guidelines version 1
- test and revise the guidelines with new annotators until the guidelines are sufficiently clear
- task should be clearly defined, but not trivial
5 human annotation (experts or crowdsourcing) - compare the labels by different annotators to estimate the reliability of the data (inter-rater agreement)
crowdsourcing
useful for tasks that humans are typically good at while computers need a lot of examples to do it properly and where no experts are needed
main challenge: quality control =>
- don’t pay too little
- have a check in the task set-up
- say that their work is compared to expert annotations
- measure inter-rater agreement
why would we compute the inter-rater agreement?
- human labelled example data is the truth for the classifier
- but 2 human classifiers never fully agree
- so we always have part of the example data labelled by 2 or 3 raters and compute the inter-rater agreement => know the reliability of the example data and measure the difficulty of the task
- Cohen’s Kappa
Cohen’s Kappa
k = (Pr(a) - Pr(e)) / (1 - Pr(e)) (example on slide 54)
Pr(e) = Pr(e, c1) + Pr(e,c2) etc.