Module 3_ 1. Real world problem_ Predict rating given product reviews on Amazon Flashcards
Why convert text to vector? Explain with an example.
We can leverage the power of linear algebra once we convert the text to a vector
Eg.
Review text ——> d-dim vector —–> Find a plane to separate the +ve and -ve points
if wTxi > 0 then ri is +ve
else ri is -ve
review text 1 —–> r1 r2 r3 —–> v1 vector
review text 2 —–> r1 r2 —–> v2 vector
review text 3 —–> r3 —–> v3 vector
if English-similarity(r1,r2) > English-similarity(r1,r3),
then distance(v1,v2) < distance(v1,v3)
————-> r1 and r2 are more similar then v1 and v2 must be closer
————-> length(v1-v2) < length(v1-v3)
Similarly we can make use of other tools like BOW, TFIDF, W2V, AVGW2V, TFIDFW2V, etc. once we convert our text to vector
Explain BOW with example.
BOW ——> Bag of words
Bag of words is a representation of text that describes the occurrence of words within a document
r1 : This pasta is very tasty and affordable
r2 : This pasta is not tasty and affordable
r3 : This pasta is delicious and cheap
r4 : Pasta is tasty and pasta tastes good
Steps:
1. Constructing a dictionary —> Set of all the unique words in your reviews ——>{This, pasta, is, very, ……..}
2.
v1 : [ 0, 0, 0, 1, ……. , 1, ………… ] —-> # times word occurs in ri
a an The Pasta …….. This ……….
v1 —–> sparse ——> most elements are zero
d —–> large
r1 : This pasta is very tasty and affordable
r2 : This pasta is not tasty and affordable
This pasta is very tasty not and affordable
v1 : 1 1 1 1 1 0 1 1
v2 : 1 1 1 0 1 1 1 1
length(v1-v2) = ||v1 - v2|| = √(1^2 + 1^2) = √2
Here the difference in value of v1 and v2 is small but r1 and r2 are very different. This shows BOW doesn’t work very well when there are small changes.
Another variation of BOW is Binary BOW
In binary BOW —–> 1 : if wi occurs at least once
—–> 0 : otherwise
Therefore here,
||v1 - v2|| = √(# differing words)
Explain Text Preprocessing and different steps/ways for doing it.
Text preprocessing is a method to clean the text data and make it ready to feed data to the model.
Steps for text preprocessing:
1. Removing stopwords (this, is ,not, and, …….) —-> not always the best solution as we can see it removes ‘not’ as well which can be crucial.
2. Lowercase (Eg. Pasta —-> pasta)
3. Stemming (Eg. tastes, tasty, tasteful ——> taste)
4. Tokenization (Breaking sentence into words)
5. Taking semantic meaning of words into consideration (Eg. Word2vec)
Explain Uni-grams, Bi-grams and n-grams.
r1 : [This] pasta [is] [very] tasty [and] affordable
r2 : [This] pasta [is] [not] tasty [and] affordable
After removing stopwords, v1 and v2 are exactly same.
Hence, we are forced to conclude r1 and r2 are very similar.
But in reality they are not at all similar.
Uni-gram :
v1 : [ 0, 0, 0, 1, ……. , 1, ………… ] —-> # times word occurs in ri
a an The Pasta …….. This ……….
This is same as BOW
Bi-grams :
Here we will be using pair of words instead of single words.
v1 : [ 1 , 1 , ……….] —-> # times word occurs in ri
[This pasta] [pasta is] ………….
Tri-grams :
Similarly in tri-grams we use 3 consecutive words
Uni-gram BOW ———> discards the sequence
# tri-grams >= #bi-grams >= uni-grams
n-grams ——-> dimensionality ‘d’ increases
(n > 1)
Explain TF-IDF.
TF-IDF ——> Term Frequency - Inverse Document Frequency
TF(wi,rj) = (# of times wi occurs in rj)/(Total # of words in rj)
0 <= TF(wi,rj) <= 1
Dc = {r1,r2,……..,rN}
IDF(wi,Dc) = log(N/ni)
Here N —–> total # of docs and ni ——–> # of docs which contain wi
ni <= N ———> N/ni >= 1
Therefore, log(N/ni) >= 0 ——–> Since, log(1) = 0
if ni increases ; (N/ni) decreases ; log(N/ni) decreases
* IDF >= 0 ALWAYS
* If wi is more frequent in my corpus the IDF DECREASES
Note : in IDF, we use log to avoid very large values because if IDF is large it will dominate TF-IDF
Explain Word2vec.
Word2vec ——–> takes semantic meaning into consideration
word ———> dense d-dim vector
w1 : tasty —–> W2V(300-dim) —-> v1
w2 : delicious —–> W2V(300-dim) —-> v2
w3 : baseball —–> W2V(300-dim) —-> v3
w2v : 300-dim
1. w1 and w2 are similar then v1 and v2 are closer
2. Relationships (Eg. Vman - Vwomen, Vking - Vqueen, etc.)
W2V ——-> learning relationships automatically from raw-text
large text corpus —-> W2V(300,200,100,50-dim) ——> word:vec
larger dimensions ———> more info rich the vector is
data corpus size increases ——–> dimensionality ‘d’ increases
If N(wi) ≈ N(wj),
then Vi ≈ Vj
Explain Avg-W2V and TFIDF-W2V.
Avg-W2V :
r1 : w1 w2 w1 w3 w4 w5 —–> n1 words
r1 ———> v1
v1 by Avg-W2V => (1/n1)[W2V(w1) + W2V(w2) + …….. + W2V(w5)]
Avg-W2V ———-> not perfect
———-> works well enough
———-> Simple to leverage W2V to build sentence vectors
TFIDF-W2V :
r1 : w1 w2 w1 w3 w4 w5
tfidf : [ t1, t2 , t3 , t4 , t5 , 0 , 0]
w1 w2 w3 w4 w5 w6 w7
TFIDF-W2V(r1) = [(t1 * W2V(w1) + t2 * W2V(w2) + ………. + t5 * W2V(w5))/(t1 + t2 + t3 + t4 +t4 +t5)]
TFIDF-W2V(ri) = Σ(ti * W2V(wi))/Σti