Module 10 Flashcards

1
Q

Graph

A

-nodes (people) and edges (the lines that people are connected by)

Type:
Normal graph:
Arrows go both ways between each node

Directed Graph:
each line on the graph has an arrow, but the arrows dont need to go back and forth between nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is the “World Wide Web” called a web

A

-because it looks like a spiders web because its graph is like arrows connecting

-think one website links to anotehr which links to another
-its a directed graph

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Spiders

A

-A computer program goes across the internet looking for information and collecting it
- starts at one website and will go to other websites from that website
-once it reaches a dead end (a website that doesnt link to another, it goes back and tries another “route”)
-these spiders go around ALL THE TIME, collecting INDEX information so that when osomeone has to search for it another time, it can

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Indexing the web

A

1) Focus Spider:
Targets a certain topic like potatoes, it does the same thing as a normal spider exceot it is following websites with a certain topic
-once it reaches a website that doesnt have much info on that topic, it abandons it and goes looking for different websites with that topic

2) Politeness Spider:
-spider that goes around web and cooperates with websites
-follows the sites instructions and doesnt bombard it with lots of requests
-if a website is baradded due to not polite spiders: “service attack” so lots of spiders sent to a webpage, and causes it to crash cuz lots of spiders are trying to request it
-used for political activism

3)Revisit Frequency
-spiders have to go back to see if a website has changed and to get the new info on it

4)Paywalls
-if a website like new york times has a paid subscription, how can spiders for google get info on it to promote it?
-these companies work with google to figure out something bevause just giving a back door would allow people like us to see too

5)Dynamic Content
-if a webpage is dynamic (changes based on environemt) it may look different from a spider viewing it and us
-another issue we must deal with with spiders then cuz how are we to get every possi le outcome?

6) Query Strings
-allows us to have extra info so we can specify what we want to see
- does this eman that the spider has to go and look at every single page of the website?
-another issue!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What do web searches have to worry about?/ problems with indexing(list making of spiders)

A

Spiders go around collecting info but they bring it back to one central location that interprets the info

1) List of Occurrences:
-spiders build a list(index) of all the websites with the key word
-keeps track of where the word appears in the page

2) Punctuation and Hyphens/ foreign languages and accents
ie email vs e-mail
-we dont want to have two different indexes for these words since its the same thing
-so must be able to detect that it is the same thing cuz we dont want to track them seperately

3)Stop Words
-words such as It, The, is, I
-we dont want to make an index of these words cuz thats just pointless

4)Word Variants
ie Sell Sells Selling Sold
-they are basically the same thing so we would want their index to be the same not different

5)Spelling Variants
ie color and colour
we want it to be indexed as the same thing
GOOGLE even keeps track of misspelled words so that the correct things come up even if it is misspelled

Semantics (6 and 7)
6) Synonymy:
ie big and large
-we would want them to be indexed similarily so that when someone searches one, the other also comes up

7)Polysemy:
-want seperate indexes for same word different meansings
ie bank (river), bank (the insitution)
-we dont want them indexed the same because we dont want rivers coming up when soeone wants a financial place
-fixed this by the engine determines which one to use based om the words arround it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Evil Spiders

A

can go around collecting data for bad reasons

1)Scraping and Stealing Content:
Copies content and makes it own webpage, and may make it more appealing so more people look at theirs than urs

2)Collect email addresses
-give them to spammers and they can spam u now

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does the search engine do when Searching Phrases with more than 1 word ie potato soup

A

1) goes through all the pages for potato and indexes it (including its position)
2) goes through all the pages for soup and indexes it and its position
3) looks for pages that contain both (combinations) and looks to see that their position on the webpage is next to eachother

ie
potato: locartion 1 6 10
Soup: location 2 7 11

we know that these words are next to eachother so they know that it is a good page for POTATO SOUP and will return this page

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Page Ranking

A

GOOD THINGS:
1) Lots of references to it
-when lots of websites reference a website, it raises its ranking because it makes it seem that it is important and a more authorative source
ie wikipedia (lots of people reference it)
(look at pic on notes)

NOTE: if something doesnt have alot of references BUT an important, very linked to page links to it, then it will be pretty important!

2)HTML elements
-having a named link to a page with the name of the link the key words (potato soup)
-<title> Potato Soup
-<h> potato soup
all these make it pretty important</h></title>

3)Original Content
-if its something it hasnt seen before, then it will be ranked higher bc to them its like if someone needs this, this will be a good source to give them since its original and not talked about

3) Authorative Sources
-wikipedia, cnn
-automatically get ranked higher cuz its trusted and they know people work to make sure the content is good

PENALIZING THINGS:
1)Excessive Ads:
-relative to content they have
-this reduces the liklihood of copied websites profitting too
2)Aggregators (no “new” content)
-gets a bunch of webpages and smashes them together into one so that their liklihood of getting viewed increases

google can prevent these
-they also have mehcnaisms to prevent things from getting popular just cause it is geting negative reviews about it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Stop Pages

A

if a person is lookig at a bunch of websites, the website it stops at is deemed the better lihnk than the others and is called the stop page

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Manipulating Search

A

1) Hidden Text
-Have text that matches background
-spider thinks those words are on the page but actually they are not visible to human eye
-so it may rank the page higher cause it has key words but theose are actually not what the webpages is about

2)Aggregators
-Smashes bunch of stuff on one webpage so it seems like the webpage is about a bunch of good content so ranks it higher but its actually just a mush of nothing

3) Link Farms
-bunch of fake webpages are made and are LINKED to eacjhother to make it seem like these pages are really important
-or a bunch of pages link to one page so that one page gets a high ranking since google things its important since lots of links to it

4) Website Hijacking
-Comments on trusted websites can make it seem like the trusted website linked another website but it actually didnt
-this would increase that other pages ranking

5)Google Bombs
-a bunch of people get together and start linking certain words to certain topocs so that when that word is searched, it gets associated with that unrelated topic
-“miserable failure” and george Bush

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Effective Google Searching (SEARCH OPERATORS)

A

1) using minus
potato soup - celery
: will show webpages without the word celery

2)” “
-to search for exact phrases in SEQUENCE
-will only show things where these two words occur in SEQUENCE
ie “gluten free”

3) OR
-shows results for either or, doesnt have to be both
ie bacon OR delicious

4) ..
-used for searching a range of numbers
ie 10..122

5).pdf
for pdfs

6) @, $
social media tag
prices

7)*
-placeholder for an unknown word

8)site:
-ie site:nbc.com
-will only show sites with that ending (website)

When on advanced settings you can also search for
-cetain langauge
-region
-when it was uploaded
-the site or domain ie .org

How well did you know this?
1
Not at all
2
3
4
5
Perfectly