Exam review Module 10 Flashcards

1
Q

facebook graph

A

¥ Series of connections between you and your friends, their friends, etc. (makes up for all of the people on Facebook and how they’re inter-connected.
¥ A graph is some nodes (you, your friends) that are connected by edges.
¥ You can use this to represent physical systems, virtual systems.
¥ ex: Cities, Airports, Computers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

example 2: directed graph

A

¥ Sally may follow you, and you may follow Sally.
¥ However, you and Sally follow Justin Bieber, but he does not follow you.
¥ (Shows direction of following with arrows, as opposed to a normal graph.)
¥ It can represent more complex relationships

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Example 3: World Wide Web

A

¥ It is called WWW because the graph that represents the web looks like a spider’s web.
¥ Every (almost) webpage is interconnected through hyperlinks, etc.
¥ Represented by a directed graph

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Spiders/Crawlers/Robots

A

¥ A spider is a computer program that starts at one website, and will start exploring the links for other websites. It will go through multiple websites. (It tries until it can’t go any further).
¥ They are constantly searching the internet and collect indexed information so you can search it later.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

indexing the web

in malicious ways

A

: DDos - denial of service attack -> throw millions of spiders to overwhelm a webpage or network)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

focus-spider

A

a spider with focus will focus on one subject like potatoes until it can’t find any more info.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

politeness

A

co-operative, follows instructions, doesn’t bring tons of requests.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

revisit frequency

A

¥ some webpages change a lot, some never change. spiders need to check if anything is new.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Paywalls

A

subscriptions to websites, also the ability to enter into these websites through a back-doosr.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Dynamic Content

A

some webpages will adapt according to who is viewing it. (spider might see something different from what you’re seeing).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Query Strings

A

¥ does the spider stop and only look at the base page, or does it look at the additional pages provided through the query strings?
ex: http://learn.com/class.html?course=cs100&page=3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

stop words

A

The, it, is

These words are so common it is not practical nor useful to even try to index them within the web.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

¥ Word Variants

A

sell, sells, selling, sold, resell, resold, and unsold
When building your index, you may want to treat these words the same way, because if you’re searching for one of them, you’re probably looking for the others as well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

¥ Spelling variants

A

color vs. colour

You want to keep these in the same index as they also hold the same value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Synonymy:

A

big & large

you want them to be on the same indexes, as they hold similar value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Polysemy:

A

bank (financial institution)
bank (river bank)
bank (turning a car)
You should have separate definitions for words that spelled the same but hold different value and meaning.

17
Q

Evil Spiders

A

¥ Scraping & stealing content
Some spiders can go onto your webpage, steal the information, and have it put up onto a different website automatically, stealing traffic from your webpage.
¥ Stealing email addresses
Spiders will go around stealing email addresses, and then can forward spam to your email.

18
Q

Page Ranking

A

This made Google what it is today.
Founder of Google, Larry Page.
The more links to a webpage, the better odds of it being an authoritative source/more legitimate.
This improves the ranking of the pages when you’re looking for websites.
Randomly walking the web: Imagine x amount of ants are walking the web, and the sites in which the most ants cluster to, (frequency) the higher the page rank.

19
Q

Modern Page Ranking

A

The words on the page are important, but also important is the links referring to that website.

IF you have a page with words in the , <h1>, there’s a good chance that they’re related to whatever words you have written in there, i.e: potato soup.

More weight is emphasized on the title, headers, etc. compared to some area of the page like ‘comments section’.

Google penalizes pages with excessive ads. (You can make money easily off of a surplus of ads).

Google penalizes Aggregators so they dont get ranked highly. (don’t have any new content, as instead they mash multiple websites together in hopes of receiving more web traffic).

Reward: 
Content Quality (orginal, interesting, unique, etc.) -> higher rank
Authority Sources (wikipedia, CNN, due to their web traffic and many people working towards the quality.

Semantics and word proximity:
The search engine will try to interpret what you’re really looking for as opposed to strictly the words you’ve typed into the engine.
It’ll also look at other words related to those you’ve searched and using those relationship to deliver accurate content.

Click timing data & stop pages
Clicking data is the data that Google collects in order to see what users are doing with their clicks. Are they clicking the first link, going back, checking the second, going back, then checking the third and staying? In that case, the third link will become of higher rank.
The page in which you land on is called the ‘stop page’.
</h1>

20
Q

Manipulating Search (+ads, trojan horse to install viruses, etc.)

A

Hidden Text:
White text on a white background (user can’t see, but spiders can). in order to trick the user to clicking onto a webpage that has nothing to do with what they’re looking for.

Aggregators: Combining multiple websites into one. (Stealing of content).

Link farms: When a bunch of useless websites link to one another, or all link to one website, in hopes of a higher page rank.

Website hijacking: ex: if you can get CNN to link to your website (via comments, etc.) it’ll make your page rank higher.

Google Bombs:
Multiple people coming together to influence search results. ex: Miserable failure -> George Bush.

21
Q

10.2 - Effective Google Searching

A

Google.ca -> ‘Advanced Search’
this exact word or phrase: “gluten free” (With quotes).
any of these words: bacon OR delicious (with OR separating terms).
numbers ranging from: 10 to 100 (any unit)
filter by language
filter by region
last updated : provide a time filter
site or domain: search a specific website i.e: only wikipedia.org
terms appearing: where should the terms you’ve typed be looked for?
SafeSearch: Filter explicit results
File type: .PDF, .DOCX, etc.
Usage rights: license? free to share? etc.
Search Operators:
+ search google+ pages or blood types like AB+
@- social tag
$ - price tag
#- popular hashtags
* - for unsure words
* site: to search something specific within only 1 site