week 4 Flashcards
Collab filtering
suggesting complex items wihtout understanding the nature of them but by seeing similarties in users (amy liked it and she has similar prefrence to bob so will recomend to bob)
content filtering
find similarities in items so recomend based on them
collab filter issues
lots data , as much as possible
millions of items need lots computing power
requires one user who has seen movie or done variable
content filter requires little data to start what are some issues
Can be limited in scope
Not present something different just similar items that user already likes
clustering helps come up with better
predictive models
2 types learning algorithim
supervised- : gives you some outcome
Trying to predict something
unsupervised- don’t predict anything just group data into similar groups to build better predictive models or market segmentation etc
cluster is repped by what normally
cnetroid
how to find distance between centorid
take mean of coordinates of all the points in cluster and will give you the centroid for that cluster
or use any other distance metrics we dicussed
why normalize data
so all distance is relevant as distance highly influenced by scale of variables so normalize it
can greatly change clusters if normalized different
- Hierarchal closeting starts as
assume each point own cluster then merge into one eventually
vertical lines in dendogram smallest at
bottom once move along height will incease
how to decide how many cluster we want
To do this draw horizontal line and number of vertical lines crossed is amount of clusters to have
cluster process
o 1st thing is to calculate distance between all points
o Then find out point that has minimum distance between them and merge. Continue to do so until 1 cluster
calculate 2 points centroid then use centroid as distance for next point on dendogram
where should we draw line to decide clusters
1-depends on problem itself move line slight and changes cluster= very sensitive to info
2- if no specific want use dendogram and choose cluster in this case more robust to inaccuracy in data. good to have some robustness