Chapter2 Flashcards

Question

Simple function of Plural ?

Answer 1

def plural(word): if word.endswith('y'): return word[:-1] + 'ies' elif word[-1] in 'sx' or word[-2:] in ['sh', 'ch']: return word + 'es' elif word.endswith('an'): return word[:-2] + 'en' else: return word + 's ® \>\>\> plural('fairy') 'fairies' \>\>\> plural('woman') 'women'

Answer 2

• The endswith() function is always associated with a string object (e.g., word in 3.1). To call such functions, we give the name of the object, a period, and then the name of the function. These functions are usually known as methods.

Answer 3

• A collection of variable and function definitions in a file is called a Python module.

Answer 4

• A collection of related modules is called a package. NLTK's code for processing the Brown Corpus is an example of a module, and its collection of code for processing all the different corpora is an example of a package. NLTK itself is a set of packages, sometimes called a library.

Answer 5

A lexicon, or lexical resource, is a collection of words and/or phrases along with associated information such as part of speech and sense definitions

Answer 6

○ Lexical resources are secondary to texts, and are usually created and enriched with the help of texts. For example, if we have defined a text my\_text, then vocab = sorted(set(my\_text)) builds the vocabulary of my\_text, while word\_freq = FreqDist(my\_text)counts the frequency of each word in the text. Both of vocab and word\_freq are simple lexical resources

Answer 7

○ A lexical entry consists of a headword (also known as a lemma) along with additional information such as the part of speech and the sense definition. Two distinct words having the same spelling are called homonyms.

Answer 8

NLTK includes some corpora that are nothing more than wordlists. The Words Corpus is the /usr/share/dict/words file from Unix, used by some spell checkers. We can use it to find unusual or mis-spelt words in a text corpus

Answer 9

There is also a corpus of stopwords, that is, high-frequency words like the, to and also that we sometimes want to filter out of a document before further processing. Stopwords usually have little lexical content, and their presence in a text fails to distinguish it from other texts. Example: ○ \>\>\> from nltk.corpus import stopwords \>\>\> stopwords.words('english') ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']

Answer 10

\>\>\> def content\_fraction(text): ... stopwords = nltk.corpus.stopwords.words('english') ... content = [w for w in text if w.lower() not in stopwords] ... return len(content) / len(text) ... \>\>\> content\_fraction(nltk.corpus.reuters.words()) 0.7364374824583169

Answer 11

One more wordlist corpus is the Names corpus, containing 8,000 first names categorized by gender. The male and female names are stored in separate file

Answer 12

\>\>\> names = nltk.corpus.names \>\>\> names.fileids() ['female.txt', 'male.txt'] \>\>\> male\_names = names.words('male.txt') \>\>\> female\_names = names.words('female.txt') \>\>\> [w for w in male\_names if w in female\_names] ['Abbey', 'Abbie', 'Abby', 'Addie', 'Adrian', 'Adrien', 'Ajay', 'Alex', 'Alexis', 'Alfie', 'Ali', 'Alix', 'Allie', 'Allyn', 'Andie', 'Andrea', 'Andy', 'Angel', 'Angie', 'Ariel', 'Ashley', 'Aubrey', 'Augustine', 'Austin', 'Averil', ...] It is well known that names ending in the letter a are almost always female.Remember that name[-1] is the last letter of name below

Answer 13

• NLTK includes so-called Swadesh wordlists, lists of about 200 common words in several languages. The languages are identified using an ISO 639 two-letter code. example ○ \>\>\> from nltk.corpus import swadesh \>\>\> swadesh.fileids() ['be', 'bg', 'bs', 'ca', 'cs', 'cu', 'de', 'en', 'es', 'fr', 'hr', 'it', 'la', 'mk', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sr', 'sw', 'uk'] \>\>\> swadesh.words('en') ['I', 'you (singular), thou', 'he', 'we', 'you (plural)', 'they', 'this', 'that', 'here', 'there', 'who', 'what', 'where', 'when', 'how', 'not', 'all', 'many', 'some', 'few', 'other', 'one', 'two', 'three', 'four', 'five', 'big', 'long', 'wide', ...]

Answer 14

○ Perhaps the single most popular tool used by linguists for managing data is Toolbox, previously known as Shoebox since it replaces the field linguist's traditional shoebox full of file cards. Toolbox is freely downloadable from http://www.sil.org/computing/toolbox/.

Answer 15

○ WordNet is a semantically-oriented dictionary of English, similar to a traditional thesaurus but with a richer structure. NLTK includes the English WordNet, with 155,287 words and 117,659 synonym sets. We'll begin by looking at synonyms and how they are accessed in WordNet.

Answer 16

• A text corpus is a large, structured collection of texts. NLTK comes with many corpora, e.g., the Brown Corpus, nltk.corpus.brown. • Some text corpora are categorized, e.g., by genre or topic; sometimes the categories of a corpus overlap each other. • A conditional frequency distribution is a collection of frequency distributions, each one for a different condition. They can be used for counting word frequencies, given a context or a genre. • Python programs more than a few lines long should be entered using a text editor, saved to a file with a .py extension, and accessed using an import statement. • Python functions permit you to associate a name with a particular block of code, and re-use that code as often as necessary. • Some functions, known as "methods", are associated with an object and we give the object name followed by a period followed by the function, like this: x.funct(y), e.g., word.isalpha(). • To find out about some variable v, type help(v) in the Python interactive interpreter to read the help entry for this kind of object. • WordNet is a semantically-oriented dictionary of English, consisting of synonym sets — or synsets — and organized into a network. Some functions are not available by default, but must be accessed using Python's import statement

Answer 17

To find unique words

Answer 18

Always normalize text using w.lower because using All and all will be counted diff in text processing

Answer 19

len(set(a.lower().split(' ')))

Answer 20

s. lower() , s.upper() , s.titlecase() s. split(t) , s.joint(t) , s.strip(t) ,s.rstrip() , s.find(t) , s.rfind(s) , s.replace(u,v)

Answer 21

1. using list function 2. using [w for w in word]

Answer 22

use strip function first to remove the space and tabs then use split function to split the text with space The strip() method returns a copy of the string with both leading and trailing characters removed (based on the string argument passed). The strip() removes characters from both left and right based on the argument (a string specifying the set of characters to be removed). The syntax of strip() is: string.strip([chars]) strip() Parameters chars (optional) - a string specifying the set of characters to be removed. If the chars argument is not provided, all leading and trailing whitespaces are removed from the string

Answer 23

All this find string , find string in reverse and find and replace string

Answer 24

f = open(filename, mode) f.readline() , f.read() , f.read(n) for line in f : do something(line) f. skeek() f. write() f. close() f. closed to check if a particular file is really close.

Answer 25

1. s = """ this is a very long string if I had the energy to type more and more ...""" but will have(\n) at end this is a very\n long string if I had the\n energy to type more and more ...' 2. s = ("this is a very" "long string too" "for sure ..." ) This will not include extra space 3. Breaking lines by \ works for me. longStr = "This is a very long string " \ "that I wrote to help somebody " \ "who had a question about " \ "writing long strings in Python" 4. string = """This is a very long string, containing commas, that I split up for readability""".replace('\n', ' ' )

Answer 26

print([word for word in tweet.split() if word.startswith('#')])

Answer 27

Advice on using text. Always split ur text into words before working on each words

Answer 28

[w for w in (knowntext) if condition(on w)]

Answer 29

[] means matches one of the character inside, ['abc] ...matches any charcter that is not abc a|b...matches a or b \d means any digit [0-9] \D any non digit [^0-9]

Answer 30

\s....means any whitespace \S any no-whitespace \w.......means alpha numerice character \W ..non alpha numeric character

Answer 31

\* : matches zero or more occurance + : matches one or more occurances ?: matches zero or one more occurance {n} exactly n repitions {n , } atleast n repition {,n} at most n repition {m,n} atleast n repition and atmost n repition

Answer 32

a = "usman @gwammaja @bayan masqa akwai spa"c In [72]: [w for w in b.split() if re.search('@[A-Za-z0-9]+',w)] Out[72]: ['@gwammaja', '@bayan'] Using Alpha numeric character ? [w for w in b.split() if re.search('@\w+',w)] Out[73]: ['@gwammaja', '@bayan']

Answer 33

The 'r' in front tells Python the expression is a raw string. In a raw string, escape sequences are not parsed. For example, '\n' is a single newline character. But, r'\n' would be two characters: a backslash and an 'n'.

Answer 34

re. match(), re.search(), re.findall() with Example https: //www.guru99.com/python-regular-expressions-complete-tutorial.html

Answer 35

“Live as if you were to die tomorrow. Learn as if you were to live forever,” as Mahatma Gandhi

Answer 36

8 , sent2 , sent13

Answer 37

list(sent(sent7))[:10]

Answer 38

Since lists in Python store ordered collections of items or objects, we can say that they are sequence types, exactly because they behave like a sequence. Other types that are also considered to be sequence types are strings and tuples.

Answer 39

You might wonder what’s so special about sequence types. Well, in simple words, it means that the program can iterate over them! This is why lists, strings, tuples, and sets are called “iterables”.

Answer 40

Tuples are used to collect an immutable ordered list of elements. This means that: You can’t add elements to a tuple. There’s no append() or extend() method for tuples, You can’t remove elements from a tuple. Tuples have no remove() or pop() method, You can find elements in a tuple since this doesn’t change the tuple. You can also use the in operator to check if an element exists in the tuple. So, if you’re defining a constant set of values and all you’re going to do with it is iterate through it, use a tuple instead of a list. It will be faster than working with lists and also safer, as the tuples contain “write-protect” data.

Answer 41

A list stores an ordered collection of items, so it keeps some order. Dictionaries don’t have any order. Dictionaries are known to associate each key with a value, while lists just contain values. Use a dictionary when you have an unordered set of unique keys that map to values. Note that, because you have keys and values that link to each other, the performance will be better than lists in cases where you’re checking membership of an element.

Answer 42

Just like dictionaries, sets have no order in their collection of items. Not like lists. Set requires the items contained in it to be hashable, lists store non-hashable items. Sets require your items to be unique and immutable. Duplicates are not allowed in sets, while lists allow for duplicates and are mutable. You should make use of sets when you have an unordered set of unique, immutable values that are hashable. You aren’t sure which values are hashable?

Answer 43

This method returns a list of all the available keys in the dictionary.

Answer 44

nltk.help.upenn\_tagset()

Answer 45

In simpler terms, a hyponym is in a type-of relationship with its hypernym. For example, pigeon, crow, eagle and seagull are all hyponyms of bird (their hypernym); which, in turn, is a hyponym of animal. Hyponymy shows the relationship between a generic term (hypernym) and a specific instance of it (hyponym). A hyponym is a word or phrase whose semantic field is more specific than its hypernym. The semantic field of a hypernym, also known as a superordinate, is broader than that of a hyponym

Answer 46

The Online Etymology Dictionary states that the use of the term to mean "calculating machine" (of any type) is from 1897."

Answer 47

para = "This is my message to you shamsudd I will come to your house tomorrow" words = word\_tokenize(para) print(words) useful\_words = [word for word in words if word not in stopwords.words('english')] print(useful\_words)

Chapter2 Flashcards

(77 cards)