Exam - google search Flashcards
Analogy - Internet as a directed graph + practical importance for modelling the web
sites hyperlink to other sites –> sometimes 2-way, sometimes one-way
serves as the basis for web discoverability + page ranking for search results
web crawlers
aka bots/spiders
- automated entities that visit websites and collect information on what the content is
- may use provided web-maps to effectively collect content information that may otherwise be missed
- travel to websites using hyperlinks –> notes which websites link to which + how many hyperlinks
- websites get updated –> need to be re-crawled
good webcrawler
- used to develope search queries –> collect info on web content to determine which websites are most relevant to given search results
- does not recrawl too frequently –> prevent stressing the web server with excess traffic
- obeys paywalls –> does not break content barriers unless directly permitted by site map code
- in unable to scrape personal info
bad webcrawler
- scrapes web content to duplicate it –> conternt theft
- gather personal data to generate spam/phishing (may involve exploiting vulnerabilities)
- generates spam comments in forums/chat
- ad hosting costs $$$$ –> bot clicks teh ad to intentionally waste advertiser money
- excess web crawling –> DDOS
Web indexing explained + factors influencing (TRUQD)
Analysis of web content to classify website –> use data to shape web results
factors:
- website trustworthiness
- content readability
- content uniqueness
- content quality
- duplication of existing content
TRUQD
Website is crawled, analyzed and indexxed –> how is the index info stored
search engine stores keywords + sequence of appearance + frequency of each – >used to gauge relevance to diff topics
assessing content quality (SHERMIUQ)
- Relevance to search query
- quality of writing
- Importance to the problem
- last updated (recent = better)
- mobile friendly (friendly = better)
- HTML structure (organized tags = better)
- Social media presence (more shares = better)
- Engagement (longer visits + more views = better)
SHERMIUQ
(social, HTML, Engag, Rel, Mobile, Import, Upda, Quality)
page ranking - Hyperlinking + factors explained
- more redirects to the page = good
- redirected from trustworthy sites = good
- redirected from popular sites = good
- page is bookmarked more = good
- more web engagement = good
of redirects + trustworthiness/popularity of redirects + web engagement
SEO - premise
Search engine optimization
- 3rd party company hired to increase web traffic to a website
- SEO reverse engineers the search algo –> determines what factors improve page rank
blackhat/evil SEO
- keyword stuffing + content cloaking - embed keywords into website (boost rank) + embed hidden keywords (shows up in irrelevant search queries)
- embed hidden hyperlinks –> search engine crawlers combat this by analyzing if the links are even seen by the user (unseen = irrelevant)
- paying other websites to link to customer’s site
- spamming comments/chats/forums with hyperlinks –> more redirects# (if posted in popular sites, even better)
- content theft - steal higher quality content to improve site quality ranking
black hat SEO sabotage
send bad traffic to competitors
eg redirects from shady/sketchy sites –> degrade page rank