NLU Training Data Format Flashcards
What are the nlu training data formats?
- Markdown Format
- JSON Format
Which is the better format to work with?
Markdown is usually easier to work with.
What is the structure of the markdown format?
Examples are listed using the unordered list syntax, e.g. minus -, asterisk *, or plus +.
Examples are grouped by intent, and entities are annotated as Markdown links, e.g. entity.
“””
## intent:check_balance
- what is my balance
- how much do I have on my savings
- how much do I have on my savings account
- Could I pay in yen?
intent:greet
- hey
- hello
## synonym:savings - pink pig
## regex:zipcode - [0-9]{5}
lookup:currencies
- Yen
- USD
- Euro
lookup:additional_currencies
path/to/currencies.txt
“””
What are the different parts the Rasa NLU data is structured into?
- common examples
- synonyms
- regex features and
- lookup tables
What is an example of the JSON format?
The JSON format consists of a top-level object called rasa_nlu_data, with the keys common_examples, entity_synonyms and regex_features. The most important one is common_examples.
""" { "rasa_nlu_data": { "common_examples": [], "regex_features" : [], "lookup_tables" : [], "entity_synonyms": [] } } """ The common_examples are used to train your model. You should put all of your training examples in the common_examples array. Regex features are a tool to help the classifier detect entities or intents and improve the performance.
What is the structure of Common examples?
Common examples have three components: text, intent and entities. The first two are strings while the last one is an array.
- The text is the user message [required]
- The intent is the intent that should be associated with the text [optional]
- The entities are specific parts of the text which need to be identified [optional]
Entities are specified with a start and an end value, which together make a python style range to apply to the string, e.g. in the example below, with text=”show me chinese restaurants”, then text[8:15] == ‘chinese’. Entities can span multiple words, and in fact the value field does not have to correspond exactly to the substring in your example. That way you can map synonyms, or misspellings, to the same value.
## intent:restaurant_search - show me [chinese](cuisine) restaurants
What is the structure of Regular expressions features?
Regular expressions can be used to support the intent classification and entity extraction.
For example, if your entity has a deterministic structure (like a zipcode or an email address), you can use a regular expression to ease detection of that entity. For the zipcode example it might look like this:
""" ## regex:zipcode - [0-9]{5}
regex:greet
- hey[^\s]*
“””
The name doesn’t define the entity nor the intent, it is just a human readable description for you to remember what this regex is used for and is the title of the corresponding pattern feature.
As you can see in the above example, you can also use the regex features to improve the intent classification performance.
What is the purpose of regular expression features?
Regular expressions can be used to support the intent classification and entity extraction.
Regex features don’t define entities nor intents!
They simply provide patterns to help the classifier recognize entities and related intents.
Hence, you still need to provide intent & entity examples as part of your training data!
What pipeline components currently support Regex features for entity extraction ?
Regex features for entity extraction are currently only supported by the CRFEntityExtractor component!
Hence, other entity extractors, like MitieEntityExtractor or SpacyEntityExtractor won’t use the generated features and their presence will not improve entity recognition for these extractors.
Currently, all intent classifiers make use of available regex features.
How does Lookup Tables help Improve Intent Classification & Entity recognition?
When lookup tables are supplied in training data, the contents are combined into a large, case-insensitive regex pattern that looks for exact matches in the training examples.
These regexes match over multiple tokens, so lettuce wrap would match get me a lettuce wrap ASAP as [0 0 0 1 1 0].
These regexes are processed identically to the regular regex patterns directly specified in the training data.
Lookup tables in the form of external files or lists of elements may also be specified in the training data.
The externally supplied lookup tables must be in a newline-separated format.
For example, data/test/lookup_tables/plates.txt may contain: """ tacos beef mapo tofu burrito lettuce wrap """
And can be loaded as:
## lookup:plates
data/test/lookup_tables/plates.txt
Alternatively, lookup elements may be directly included as a list """ ## lookup:plates - beans - rice - tacos - cheese """
How does Normalizing Data help Improve Intent Classification & Entity recognition?
By defining Entity Synonyms
If you define entities as having the same value they will be treated as synonyms. Here is an example of that:
“””
## intent:search
- in the center of NYC
- in the centre of New York City
“””
What pipeline component is required to use synonyms?
To use the synonyms defined in your training data, you need to make sure the pipeline contains the EntitySynonymMapper component
Alternatively, you can add an “entity_synonyms” array to define several synonyms to one entity value. Here is an example of that:
synonym:New York City
- NYC
- nyc
- the big apple
How run NLU model server only?
rasa run --enable-api --cors “*” --debug -m models/nlu-20191211-162304.tar.gz