Analyze text with the Language service Flashcards
Analyzing text
is a process where you evaluate different aspects of a document or phrase, in order to gain insights into the content of that text.
Text Analytics Techniques
Text analytics is a process where an artificial intelligence (AI) algorithm, running on a computer, evaluates these same attributes in text, to determine specific insights.
techniques that can be used to build software to analyze text
- Statistical analysis of terms used in the text. For example, removing common “stop words” (words like “the” or “a”, which reveal little semantic information about the text), and performing frequency analysis of the remaining words (counting how often each word appears) can provide clues about the main subject of the text.
- Extending frequency analysis to multi-term phrases, commonly known as N-grams (a two-word phrase is a bi-gram, a three-word phrase is a tri-gram, and so on).
- Applying stemming or lemmatization algorithms to normalize words before counting them - for example, so that words like “power”, “powered”, and “powerful” are interpreted as being the same word.
- Applying linguistic structure rules to analyze sentences - for example, breaking down sentences into tree-like structures such as a noun phrase, which itself contains nouns, verbs, adjectives, and so on.
- Encoding words or terms as numeric features that can be used to train a machine learning model. For example, to classify a text document based on the terms it contains. This technique is often used to perform sentiment analysis, in which a document is classified as positive or negative.
- Creating vectorized models that capture semantic relationships between words by assigning them to locations in n-dimensional space. This modeling technique might, for example, assign values to the words “flower” and “plant” that locate them close to one another, while “skateboard” might be given a value that positions it much further away.
In Microsoft Azure, the Language cognitive service can help simplify application development by using pre-trained models that can:
- Determine the language of a document or text (for example, French or English).
- Perform sentiment analysis on text to determine a positive or negative sentiment.
- Extract key phrases from text that might indicate its main talking points.
- Identify and categorize entities in the text. Entities can be people, places, organizations, or even everyday items such as dates, times, quantities, and so on.
The Language service
is a part of the Azure Cognitive Services offerings that can perform advanced natural language processing over raw text.
Azure resources for the Language service
- A Language resource - choose this resource type if you only plan to use natural language processing services, or if you want to manage access and billing for the resource separately from other services.
- A Cognitive Services resource - choose this resource type if you plan to use the Language service in combination with other cognitive services, and you want to manage access and billing for these services together.
Language detection
Use the language detection capability of the Language service to identify the language in which text is written. You can submit multiple documents at a time for analysis. For each document submitted to it, the service will detect:
- The language name (for example “English”).
- The ISO 6391 language code (for example, “en”).
- A score indicating a level of confidence in language detection.
predominant language
if the text containing a mix of English and French. The language detection service will focus on the predominant language in the text. The service uses an algorithm to determine the predominant language, such as length of phrases or total amount of text for the language compared to other languages in the text.
Ambiguous or mixed language content
There may be text that is ambiguous in nature, or that has mixed language content. These situations can present a challenge to the service. An ambiguous content example would be a case where the document contains limited text, or only punctuation. For example, using the service to analyze the text “:-)”, results in a value of unknown for the language name and the language identifier, and a score of NaN (which is used to indicate not a number).