Natural Language Processing Flashcards

Question

{0} is the same as what metacharacter? a. {,} b. * c. + d. \

Answer 1

b. * {0} = *

Answer 2

c. + {1} = +

Answer 3

a. ? {0,1} = ?

Answer 4

match() a. determines if the RE matches at the beginning of the string

Answer 5

search() b. scans through a string, looking for any location where this RE matches

Answer 6

c. | | means logical or that is used to join.

Answer 7

b. $ $ A regular expression matches at the end of a string, or any location followed by a newline character

Answer 8

d. () () is a regular expression that makes a group of characters to be treated just like a single character. The () makes a group. If you want to find all the 'the' just group them all shown below. then search in doc (the)+ m=p.search(doc)

Answer 9

It is the beginning and end ex starts at 7 and ends at 22

Answer 10

a. split() split()- splits the string into a list, splitting it wherever the RE matches. ex: abc, f12, 1349,a if we want to split the result output: [abc; f12; 1349; a]

Answer 11

b. sub() (also known as substitute) sub() - finds all substrings where RE matches and replaces them with a different string.

Answer 12

c. subn() Does the same thing as sub() but returns with a new string and the number of replacements.

Answer 13

a. word tokenizer p = re.compile('\W+')

Answer 14

a. NLP (natural language processing) -a field that focuses on software's ability to understand and process human languages

Answer 15

a. tokenization Spreading the text into tokens minimal meaningful units. This can be words, sentences. or sentences into words.

Answer 16

b. parts of speech Assigning parts of speech to text ex. noun, proverb, etc.

Answer 17

c. stemming process of reducing words to their stem. ex. walking -> walk

Answer 18

d. lemmatization similar to stemming, operates by including word context, but includes "good or better"

Answer 19

b. NER name entity recognition, labels the sequence of words of names of things. ex. person, company, or street

Answer 20

d. parsing analyze the grammer of the text to extract the same text form

Answer 21

a. text classification text classification - used for filtering information in web search. Helps avoid SPAM emails by classification.

Answer 22

a. sentiment analysis Sentiment Analysis- identify opinions and sentiments of the audience. Understand emotions of audience via social media.

Answer 23

a. chatbots helps in customer support and assistance through low priority tasks. Also used in HR Systems like how many vacation days left.

Answer 24

a. customer service offers insights into audience preferences and helps improve customer satisfaction

Answer 25

d. natural language processing offers document summarization, machine translation, and speech recognition.

Answer 26

TRUE Natural Language is is any language that has evolved naturally through use and repetition without conscious planning or premeditation Natural Language is what humans use to communicate and it has evolved with human evolution

Answer 27

NLP is a science that focuses on : A. Software's ability to understand and process human's language NLP has evolved as a science to build programs or software capable of understanding human language

Answer 28

TRUE NLTK stands for Natural Language Tool Kit

Answer 29

a. tokenization TOKENIZATION process of breaking up text into smaller pieces (tokens)

Answer 30

a. stop words Stop words - words that are commonly used. Language specific also.

Answer 31

FALSE Tokenizing a sentence is spliting it into tokens ( words )

Answer 32

TRUE POS taggig is the process of assigning part of speech tags to tokens (words) Tags include noun, verb, adjective etc…

Answer 33

A. Extract keywords or features from a Text TF-IDF is a technique used to find what are the dominant words or keywords in a text

Answer 34

A. Sentiment Analysis Process of computationally classifying and categorizing opinions expressed in a piece of text.

Answer 35

D. Word Embeddings ________is the first layer of the neural network. ______ Allows words with similar meanings to have similar representation.

Answer 36

This shows that you are using a neural network and need to add layers in sequence.

Answer 37

b. more negative result Sentiment analysis closer to 0 is a more negative result.

Answer 38

a. more positive result Sentiment analysis closer to 1 is a more negative result.

Answer 39

FALSE Sentiment Analysis is used to classify text based on the opinion or the sentiment of the writer: Negative or Positive.

Answer 40

FALSE Dataset must be split into Training and Test data in order to avoid wrong performance calculations.

Answer 41

True The basic mechanics of machine learning is to make computers act without being explicitly programmed to do so.

Answer 42

A. Machine Learning Machine Learning has given us: -Fraud detection -Web search -Self-Driving cars -Online shopping recommendations

Answer 43

A. OCR (Optical Character Recognition) OCR (Optical Character Recognition) - helps us take a picture of what someone else wrote in a board and convert it into text. ex. scan a doc and want to convert to word document

Answer 44

B. Machine Learning Machine Learning applications include: -Facebook news feed -Self-Driving cars -Virtual personal assistant -Email spams -Online customer support

Answer 45

a. Decision Tree Decision Trees - Mainly used for classification problems, repick the most significant attribute and then splits them creating a tree like structure

Answer 46

a. supervised - Uses data we have learned in the past and and applies what is learned on new data. It starts on dataset - Train- model.

Answer 47

unsupervised learning - if dataset is not labeled, categorized, or configured. Finds a hidden parent or structure from unlabeled data based on similarities.

Answer 48

a. linear regression Linear Regression - statistical approach to find relationshp between variables. Predicts outcome from input based on relationship between variables extracted or obtained from dataset.

Answer 49

b. logistic regression Logistic Regression - also a statistical method used to predict binary outcome, Yes / No, 0/ 1, True or False given independent variables. When outcome variable is configurable. Ex. if a transaction to be spam or not.

Answer 50

b Naive Bayes Naive Bayes - Useful for large datasets, can outperform even highly sophisticated classification methods. Form a family of simple probablistic classifiers. All attributes are independent. ex. Orange is round, certain size, and color. But would not assume these things all at once.

Answer 51

a. classification CLASSIFICATION - process of predicting the class or category of a given input / data. A program will learn from train in dataset. It can be bi-class (ex. male or female). Sentiment analyzer is an example of a bi-class classifier.

Answer 52

True: Classification can be used in A. Bi-Class B. Multi-Class

Answer 53

d. text classification Text classification - classifies textual information into categories. We want to know what people are talking about and what are people's opinion

Answer 54

d. text classification Text Classification - used to organize, structure, or organize into classes?

Answer 55

a. 1,2,3,4 Classification Steps: i. Feature extraction (ex. sentiment analyzer in keras) and transform into math representation in the form of vectors. ii. Labels: (ex. sunny, machine, learning) represent as 1, 1, 0 \ iii. Goes into training and text is analyzed. iv. Model created and tested.

Answer 56

a. 1,2,3,4,5,6 Steps to Pre-Process Dataset for Classification i. pre-process the data to get Dataset (use scikit learn) ii. Get the training and test data subjects iii. check out categories names iv. printing a single ost v. extracting features vi. calculating TF-IDF

Answer 57

a. naive bayes Multinomilan for multinomial data- used for text classification (ex. word counts for text classificatin)

Answer 58

c. Multinomial Multinomial is a Naive Bayes classifier. Used for multinomial data- used for text classification (ex. word counts for text classificatin)

Answer 59

b. SVM (support vector machines) SVM - A Naive Bayes classifier used for classification or regression problems. Uses a hyperplane seperation. Discriminative classifier given labeled data. Based on the labeled data it outputs an optimal hyperplane to either input data or categorize potential new points.

Answer 60

D. All the above Machine Learning is used in : -Recommendation engines -Self-Driving cars -Fraud Detection

Answer 61

TRUE Machine Learning is divided into two categories : Supervised and Unsupervised

Answer 62

FALSE Text Classification is used to classify text content based on topic or sentiment per example

Answer 63

A. Chatbots An Artificial Intelligence computer program that can hold a conversation with a human using natural language ex. C3PO in Star Wars EX: How is the weather going to be tomorrow

Answer 64

D. Chatterbot CHATTERBOT - a library used to build chatbots. ML conversatinoal dialouge engine built using python. Provide automated responses using queries. Easy to use and create a chatbot fastly. Uses ML algorithms to produce responses. It is multi-lingual and open source , available on GitHub

Answer 65

a. 1,2,3 i. Input ii. Process and Apply Adapters iii. Response

Answer 66

To add items to a Table View, we use: A. A program that can hold a conversation A chatbot is a computer progrm that can hold a human like conversation

Answer 67

TRUE Preprocessors are used to modify the input like 'chatterbot.preprocessors.clean_whitespace' that is used to remove white spaces

Answer 68

A. pre-processors Preprocessors are used to modify the input like 'chatterbot.preprocessors.clean_whitespace' that is used to remove white spaces

Answer 69

FALSE One of chatterbot advantages is that it supports multiple languages

Answer 70

uh and um are considered "DE-INFLUENCES" If it is an important word, you have to consider the application you are working on. a. 15 In this specific case, we are picking space units.

Answer 71

a. word Words are task dependent and language dependent

Answer 72

b. vocabulary Vocabulary - set of unique words (word types). Punctuations are not words. I always uh do the main um processing, I mean, the uh um data, processing.

Answer 73

a. punctuation In NLP, vocabulary, PUNCTUATION is NOT CONSIDERED

Answer 74

{I, always, uh, do, the, main, um, processing, I, mean, the, um, data, processing}

Answer 75

a. corpus CORPUS- large body of text and all available documents are there.

Answer 76

d. tokens Tokens- List of words (tokens) in a document. Meaning all the words in a document.

Answer 77

b. 15 every word is a token. Every word should be included.

Answer 78

b. spaCy spaCy- up to date package for NLP processing. Most modern NLP is used in this package.

Answer 79

a. 1,2,3 The Text Processing Flow i. build the vocabulary (corpus, recognize need words) ii. represent different words by word encodings (also called word embeddings) iii. classification pipeline

Answer 80

True Every NLP task requires text normalization: 1. Tokenizing words 2. Normilizing word formats 3. Segmenting sentences

Answer 81

True In "Space based Tokenization", Many languages (like Chinese, Japanese, Thai) DO NOT use spaces to seperate words.

Answer 82

a. Data Driven Approach i. recieve the data ii. learn from the data, what kind of units should be called as tokens regardless of words or not. overall: USE THE DATA to tell us HOW TO TOKENIZE

Answer 83

d. Byte Pair Encoding (BPE) Rather than using the whole words you build the vocabulary by individual characters. (all the letters in the corpus). Then add words with the letters who are most frequent. Visual: (A,B,C,D,...a,b,c,d,,) (A,B,) is merged 'AB' to the vocab. Add this to the corpuse (A,B,C,D, ....a,b,c,d,.. AB) Keep doing this until you have alot of merges that make words. "k merges'

Answer 84

a. morpheme morpheme- smallest meaning- bearing unit of a language

Answer 85

a. stems STEMS- the core meaning bearing units

Answer 86

b. affixes AFFIXES - certain parts that are apart of stems, often with grammatical functions. ex. ING -> LAMDA sign ex. SSES -> SS ex. ATIONAL -> ATE (relational -> relate)

Answer 87

a. sentence segmentation SENTENCE SEGMENTATION - very useful in preprocessing. This is critical in chatbot systems and speech recognition systems.

Answer 88

a. 1,2 Flow of Sentence Segmentation i. tokenize first ii. Use rules or ML to classify a .(.) as either 1(part of word) or 2 (in sentence)

Answer 89

a. Language modeling A model that actually predicts the probability of a sentence. Also can be known as Probabilistic Model.

Answer 90

This is machine translation. Means that High is more probable than using large.

Answer 91

a. speech recognition Same outcome just different spelling

Answer 92

b. spelling correction Because minutes is mispelled in 2nd outcome p(about fifteen minutes from) > p(about fifteen minuets from)

Answer 93

a. probability of a sentence P(w1,w2,w3,w4)

Answer 94

b. probability of a next word P(wn| w1,w2...wn-1)

Natural Language Processing Flashcards

(120 cards)