Data Analysis and Mining Flashcards

Question

Clustering

Answer 1

Grouping data together in clusters.

Answer 2

A type of data mining technique, which uses Market-Basket Data.

Answer 3

Market-Basket Data can be described by a set of items I, and a set of baskets B, with each basket being a subset of I.

Answer 4

Purchase ID ; Items Brought 101 ; milk, bread, cookies, juice 792 ; milk, juice 1130 ; milk, bread, eggs 1735 ; bread, cookies, coffee Items (I) = {milk, bread, cookies, juice, eggs, coffee} Baskets (B) = b1, b2, b3, b4, where bn represents each collection of "items bought".

Answer 5

Questions like "Which items occur frequently together in a basket?"

Answer 6

number of baskets in B containing all items j / number of baskets in b.

Answer 7

We first define "how frequently is frequent?" We can do this by setting some arbitrary number to s. So lets say we wanted to figure out how often bread is in a basket, and we wanted to say "one in every two", we could set s = 0.5.

Answer 8

So lets say we have s = 0.5, which is us saying whatever we specify should be once every two times. Now lets go back to our other example: 101 ; milk, bread, cookies, juice 792 ; milk, juice 1130 ; milk, bread, eggs 1735 ; bread, cookies, coffee If we were to ask "is buying milk and juice together frequent", we would then figure out the support for J = {milk, juice}, which calculates to: 2/4, since 101 and 792 both contain all items. This meets out threshold, 0.5, and we can deem it as frequent.

Answer 9

Data in tables is more complex, typically with useless information at times. For example, our previous table can look like this instead: Purchase ID ; Customer ID ; Items Brought 101 ; A ; milk, bread, cookies, juice 792 ; B ; milk, juice 1130 ; A ; milk, bread, eggs 1735 ; C ; bread, cookies, coffee where we add the additional Customer ID colun.

Answer 10

Now we can include With Respect To, and decide which values take priority. Using our ongoing example, we could have Purchase ID taking priority, and our table would be non-different (except for the removal of the table Customer ID) since there are no common elements. However, using Customer ID leads to A appearing twice, changing the table to A ; milk, bread, cookies, juice, eggs B ; milk, juice C ; bread, cookies, coffee Using this table can result in different frequency answers. For example, using {milk, juice} will now result in 2/3, making it even more frequent.

Answer 11

General questions which focus on {i1, i2, ...} -> j. In plain English, for example, we could have "customers who buy diapers frequently also buy beer", or "people who buy game of throne and harry potter also buy twin peaks".

Answer 12

- support of {i1, i2, ..., j} -> ideally, we want a high support. - confidence

Answer 13

Confidence is the percentage of baskets for {i1, i2, ...} containing j, and is wrote as: support of {i1, ... in, ij} / support of {i1, ... in} This should also be high and should differ significantly from the fraction of baskets containing j. If they are relatively similar, then it is close to independent whether you buy this item or not.

Answer 14

Question is {milk} -> juice Using: 101 ; A ; milk, bread, cookies, juice 792 ; B ; milk, juice 1130 ; A ; milk, bread, eggs 1735 ; C ; bread, cookies, coffee Support: {milk, juice} = 2/4 = 0.5 Confidence: support of {milk, juice} / support of {milk} (2/4) / (3/4) = 2/3 = 0.67 Result: 67 percent of all customers who bought milk also bought juice.

Answer 15

Compute all itemsets J with support >= s. If J has support >= s, then all subsets of J have support >= s.

Data Analysis and Mining Flashcards

(39 cards)