NER Flashcards by Ben Boyce

What does NER stand for?

Named Entity Recognition

How well did you know this?

Not at all

Perfectly

What is a Named Entity?

It is anything with a proper name

People, Locations, Organisations, Events, Dates, Time, Money

How well did you know this?

Not at all

Perfectly

What is Named Entity Recognition?

It is the task of labelling a text span with types of named entities

How well did you know this?

Not at all

Perfectly

What are some popular NER tags?

People (PER)

Organisations (ORG)

Locations (LOC)

Geo-Political Entity (GPE)

How well did you know this?

Not at all

Perfectly

What are NER approached based on?

They are based on tagsets

How well did you know this?

Not at all

Perfectly

What is a popular NER tagset? How many types does it define?

The Automatic Contact Extraction (ACE) tagset is a very popular that defines 7 types.

How well did you know this?

Not at all

Perfectly

What are some challenges of NER?

It works with spans, so working out how big a phrase needs to be labelled for a NE

Named entity type ambiguity, where a NE can be different types depending on the context

How well did you know this?

Not at all

Perfectly

What is the key method used for NER?

We treat it as a sequence labelling problem so we use BIO tagging

How well did you know this?

Not at all

Perfectly

What is BIO tagging?

It is a common approach for sequence labelling requiring span-recognition

How well did you know this?

Not at all

Perfectly

What does BIO stand for?

Begin, Inside, Outside

How well did you know this?

Not at all

Perfectly

What is the idea of BIO tagging?

We assign a tag to each word in our sequence, and each tag may represent the beginning, the middle or the end of something

How well did you know this?

Not at all

Perfectly

Explain what the image shows

It shows different types of tagging methods. It shows that IO is difficult to comprehend, as it is difficult to understand where one NE begins and ends. BIO tagging builds on this by using begin labels, which shows where NEs begin. BIOES takes this even further.

How well did you know this?

Not at all

Perfectly

What model is used to learn text according to the BIO scheme to identify NEs?

Conditional Random Fields (CRFs)

How well did you know this?

Not at all

Perfectly

What type of features are useful for NER?

Non-word features such as captilisation.

How well did you know this?

Not at all

Perfectly

Why is the Hidden Markov Model (HMM) not a good model for NER?

As they are generative, is is hard to add feature patterns

How well did you know this?

Not at all

Perfectly

What type of model is CRF?

Study These Flashcards

It is a discriminative sequence model based on a log-linear model. It is widely used for this type of sequence labelling problem.

What is the input and output for a CRF model, and their lengths?

Study These Flashcards

The input is a sequence of words, the output is a sequence of BIO tags. The length of the input will always be the same as the length of the output

What does CRF want to find?

Study These Flashcards

It wants to find the most probable sequence given a set of all possible sequences for a set of words.

What is the equation for CRF?

Study These Flashcards

Y hat is the most probable sequence, which is found using argmax. We compute the probability for each possible sequence we have given the input words and let argmax take the most probable

What does CRF define?

Study These Flashcards

A function F, which takes an input of a sequence of words and a sequence of BIO tags up to that point.

What needs to be created in order to performed NER?

Study These Flashcards

A set of K features

What does each feature have in a CRF model?

Study These Flashcards

Each feature has a corresponding weight

What is the global feature vector in a CRF model?

Study These Flashcards

It is the sum of all local features

What are the local features in a CRF model?

Study These Flashcards

Local features are features at a particular word index in the sentence. For each index position in the sentence, we use the local feature function to compute the features for that particular position. This can be summed to give a global feature vector for the entire sequence.

What makes a linear chain CRF gain its name?

The input is the set of words X, the index we are dealing with, the tag that is being predicted and the previous tag. It can only look one tag in the past, a full CRF would be able to look further behind, but it is much more computationally expensive.

How are local feature values created?

They are populated using a manually designed feature template for each word in the sequence.

Where can features get their information from?

They can get them from anywhere in the word sequence X, as we can define the context window to be whatever we want it to be.

What value types can a feature set contain?

It can contain many different types (numeric, text, boolean)

What are some common types of features?

Words, POS tags, Word shape type, Word prefix or suffix, Match to a lexicon (e.g. list of names) or gazetteer (e.g. list of cities)

What happens to the feature values over a sentence?

They are summed over the entire sentence X, meaning that there are always K features regardless of the sentence length. For example, if we have 7 local features, and a sentence that is 5 words long, we can get the local features for each of the words, so each word in the sequence is now represented by the local feature vector. To get just one feature vector, we then sum each of these feature vectors to get just one feature vector that has 7 features representing the entire sentence.

Explain what equation shows.

The equation is for the CRF model. It takes the maximum of the sum for each token position, the sum of the weighted local feature vector. For every feature type, for every position in the sequence, we sum it. This selects the most likely tag sequence for an input X. Simply put, we compute the global feature vector for the input sentence. We then weight each of the features in the global feature vector differently and we want to find the best solution with argmax

How do we evaluate NER taggers?

We can use micro or macro Precision, Recall and F1 scores

What do we apply the evaluation metrics to?

The entities as a whole rather than single words or tags as the NEs can be of different lengths

What needs to be avoided with NE?

Bias needs to be avoided with entities, e.g. is New York a good match for New York City, or do want exactly New York City as the match?

What happens to the O tags in NE?

The O tag matches will be removed since the majority of the corpus is an O tag, and this will bias results reported using mean scores to the O class performance.

What happens to the O tags in NE?

The O tag matches will be removed since the majority of the corpus is an O tag, and this will bias results reported using mean scores to the O class performance.

What are some alternatives to the CRF model?

Word Representations (CBOW, skip-gram, word2vec, GloVE, BERT) Character Representations (LSTM, GRU, CNN) Model (CNN, LSTM, GRU) Decoder (softmax, CRF)

How is CRF different to HMM?

In HMM, we have to use Bayes theorem and the likelihood P(X|Y) to calculate P(Y|X). In CRF, we can compute P(Y|X) directly, training the CRF to discriminate among the best possible tag sequences

NER Flashcards

(38 cards)