Option C: Web Science Flashcards
Internet
An interconnected set of networks and computers that permits the transfer of data governed by protocols like TCP/IP.
Acts as the physical medium for services such as the World Wide Web.
WWW (World Wide Web)
A set of hypertext-linked resources identified as URIs that are transferred between a client and a server via the Internet.
Provides a mechanism to share information.
HTTP (Hypertext Transfer Protocol)
The protocol used to transfer and exchange hypermedia.
It permits the transfer of data over a network.
HTTPS (Hypertext Transfer Protocol Secure)
a protocol for secure communication over a computer network.
consists of communication over HTTP within a connection encrypted by SSL or TLS, which ensures authentication of website using digital certificates, integrity and confidentiality through encryption of communication.
HTML (Hypertext Markup Language)
a semantic markup language that is the standard language used for web pages
URL (Uniform Resource Locator)
a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it
A URL is a specific type of URI
XML (Extensible Markup Language)
A way of writing data in a tree-structured form by enclosing it in tags. It is human readable AND machine readable and it is used for representation of arbitrary data structures.
XSLT (Extensible stylesheet language)
Styling language for XML. It is used for transforming XML documents into other XML documents or other formats such as HTML for web pages, plain text or XSL Formatting Objects, which may subsequently be converted to other formats, such as PDF, PostScript and PNG.
JavaScript
An object-oriented computer programming language commonly used to create interactive effects within web browsers.
CSS (cascading style sheet)
contain hierarchical information about how the content of a web page will be rendered in a browser.
URI (Uniform Resource Identifier)
specifies how to access a resource on the Internet. more general than a URL
Describe how a domain name server functions
(steps)
- User type the domain name into the URL search area on the web browser and press “Enter” on the keyboard.
- The domain name is intercepted by a “Domain Name Server” or DNS.
- The DNS server that the user’s system is configured to (primary DNS) checks through its own database to see if the domain name is there.
- If it isn’t the name is passed on to the next DNS server in the hierarchy.
- This continues until the domain name is found or the top level / authoritative DNS server is reached.
- When the IP address is found it is sent back to the original DNS server.
- If the IP address is not found, an error message is returned
IP (Internet Protocol)
A part of TCP/IP protocol suite and the main delivery system for information over the Internet. IP also defines the format of a packet.
TCP (Transmission Control Protocol)
A data transport protocol that includes mechanisms for reliably transmitting packets to a destination.
FTP (File Transfer Protocol)
A TCP-based network to pass files from host to host. Files can also be manipulated/modified remotely. control information (log-ins) are sent separately from the main file differing FTP from HTTP.
< head >
not visible on a page, but contains important information about it in form of metadata
< title >
inside head, displayed in tab of the web page.
< meta > tags
various type of meta tags, gives search engines information about the page, but are also used for other purposes, such as to specify the charset used.
< body >
The main part of the page document. This is where all the visible content goes in.
navigation bar
a set of hyperlinks that give users a way to display the different pages in a website
hyperlinks
“Hot spots” or “jumps” to locate another file or page; represented by a graphic or colored and underlined text.
table of contents
An ordered list of the topics in a document, along with the page numbers on which they are found. Usually located at the beginning of a long document. normally in a sidebar.
continuation
area of the web page preventing the sidebar to extend to the bottom of the web page.
protocols
A set of rules governing the exchange or transmission of data between devices.
standards
set of technical specification that should be adhered to.
ISO (International Organization for Standardization)
non-govt. org. that develops and publishes international standards. these standards ensure safety, reliability and quality for products and services.
personal page
A web page created by an individual that contain valid and useful opinions, links to important resources, and significant facts. static usually. normally created using some form of website creator like Wix.
blog
A Web log, which is a journal or newsletter that is updated frequently and published online.
- Only the owner can post an article / open a thread of discussion / start a theme.
- Registered users may be allowed to comment but the owner may moderate the comments before displaying them.
- Users cannot edit or delete posts.
Search Engine Pages
indexes content from the internet or an intranet, serves related links based on a users queries. uses web crawlers. back-end is programmed in an efficient language, e.g. C++
Forums
An online discussion group, much like a chat room.
- All registered participants can post an article / open a thread.
- All registered users are allowed to comment (without moderation).
- Can have moderators who can edit or delete posts after they have been made.
Wiki
define + evaluate
A collaborative website that can be edited by anyone that can access it
- can be vandalised by users with ill intent.
+ ability to change quickly.
Static websites
sites that only rely on the client-side and don’t have any server-side programming. The website can still be dynamic through use of JavaScript for things like animations.
static websites pros and cons
Pros:
- lower cost to implement
- flexibility
Cons:
- scalability
- hard to update
- higher cost in the long term to update content
dynamic website
A website that generates a web page directly from the server; usually to retrieve content dynamically from a database.
This allows for a data processing on the server and allows for much more complex applications.
dynamic website pros and cons
pros:
- information can be retrieved in an organised way
- allows for content management systems
- low ongoing cost, unless design changes
Cons:
- sites are usually based on templates, less individual sites
- higher initial cost
- usually larger codebase
explain the function of a browser
- interprets and displays information sent over the internet in different formats
- retrieves information from the internet via hyperlinks
server-side scripting
Also called back-end scripting; scripts are executed on the server before the web page is downloaded by a client. (e.g. if you log-in to an account, your input is sent to the server to be checked before downloading your account). These are the parts of the web page that must be refreshed whenever there is a change. e.g. CGI
Client-side scripting
Happens in the browser of the client. It is used for animations, form validation and also to retrieve new data without reloading the page. e.g. in a live-chat
cookies
hold data specific to a website or client and can be accessed by either the server or the client. the data in a cookie can be retrieved and used for a website page. some sites require cookies to function. cookies are used to transport information from one session to another and eliminate the use of server machines with huge amounts of data storage –> smaller and more efficient.
why is XML used in server-side scripting?
XML is a flexible way to structure data and can therefore be used to store data in files or to transport data. It allows data to be easily manipulated, exported, or imported. websites can then be designed separately from data content.
CGI (Common Gateway Interface)
+ how it works
Intermediary between client side and server side.
Provides interactivity to web applications / enables forms to be submitted. It uses a standard protocol that acts as an intermediary between the CGI program and the web server. The CGI allows the web server to pass a user’s request to an application program, and then the forwarded data is received to the user’s browser;
search engine
Software that interrogates a database of web pages
surface web
(open internet) web sites freely accessible to all users over the internet. web that can be reached by a search engine. static and fixed pages. e.g. Google, Facebook, YouTube
deep web
Proprietary internet web sites accessible over the internet only to authorized users and often at a cost
in-links
links that point to the page
out-links
links that point to a different page
PageRank Algorithm
analyzes links between web pages to rank relevant web pages in terms of importance
Factors:
- quantity, quality (rank), and relevance of inlinks
- number of outlinks (fewer outlinks means higher value)
HITS algorithm
- identifies a set of web pages relevant to the user’s query
- gives each web page an authority score based on the number and quality of inlinks
- gives each web page a hub score based on the number and quality of outlinks
- combines authority and hub score to generate a combined score for each web page
- ranks web pages based on this score
web-crawler
A softbot responsible for following hyperlinks throughout the Internet to provide information for the creation of a web index.
how does a web-crawler work?
for each page it finds a copy is downloaded and indexed. In this process it extracts all links from the given page and then repeats the same process for all found links. tries to find as many pages as possible.
web-crawler limitations
- they might look at meta-data contained in the head of web pages, depends on the crawler.
- a crawler might not be able to read dynamic content as they are simple programs.
robots.txt
A file written and stored in the root directory of a website that restricts the web-crawlers from indexing certain pages of the website. not all bots follow the standards, malware can ignore robots.txt, saves time by focusing crawlers on important sections of code.
relationship between meta-tags and web-crawlers
the description meta-tags provide the indexer with a short description of the page. the keywords meta-tag provides keywords about the page. while meta-tags used to play a role in ranking, it has been overused and therefore they aren’t considered as important anymore. crawlers use meta-tags to compare keywords to the content of the page for a certain weight. as such, they are still important.
parallel web-crawling pros and cons
pros:
- size of the web grows, increasing the time it would take to download pages.
- scalability; a single process cannot handle the growing web.
- network load dispersion; as the web is geographically dispersed, dispersing crawlers disperses the load.
- network load reduction
cons:
- overlapping; might index pages multiple times
- quality; if a crawler wants to download an important page first, it might not work.
- communication bandwidth; parallel crawlers need to communication which takes significant bandwidth.
- if parallel crawlers request the same page frequently over a short time it will overload the servers.
outline the purpose of web-indexing in search engines
search engines index websites in order to respond to search queries with relevant information as quick as possible. For this reason it stores information about indexed pages in its database. This way search engines can quickly identify pages relevant to a search query. also has the purpose of giving a page a certain weight to allow for ranking later.
suggest how developers can create pages that appear more prominently in search engine results.
called SEO (search engine optimization). Different techniques. Big part of web marketing as search engines do not disclose how exactly they work, making it hard for developers to perfectly optimise pages.
parameters search engines use to compare
relevance: determined by programs like PageRank. the bigger the index the more pages the search engine can return that have relevance.
user experience: search engines look to find the ‘best’ results for the searcher and part is the user experience, includes ease of use, navigation, direct and relevant information, etc.
White Hat SEO
also known as ‘ethical’ SEO, tactics focus on human audience compared to search engines
challenges to search engines as the web grows
- The sheer scope of the web (it’s getting bigger)
- Mobile devices accessing / using search
- Making search relevant locally
- Error management
- Lack of quality assurance of information uploaded
mobile computing
Technology that allows transmission of data, voice, and video via any wireless-enabled computing device, without a fixed physical connection.
types of mobile computing devices
- wearables
- smartphones
- tablets
- laptops
- transmitters
- other hardware involved in cellular networks
characteristics of mobile computing
- portability
- social interactivity
- context sensitivity
- connectivity
- individual customisation
mobile computing pros and cons
pros:
- increase in productivity
- entertainment
- cloud computing
- portability
cons:
- quality connectivity
- security concerns
- power consumption
ubiquitous computing
The condition in which computing is so woven into the fabric of everyday life that it becomes indistinguishable from it. e.g. smart watches.
type of ubiquitous computing devices
- embedded devices
- mobile computing devices
- networking devices
peer-to-peer computing (P2P)
A process in which people share the resources of their computer by connecting directly and communicating as equals.
types of P2P devices
usually PCs
characteristics of P2P
- decentralised
- each peer acts as client and server
- resources and contents shared amongst all peers and shared faster than client to server
- malware can be faster distributed
grid computing
a computer network where each computer shares its resources with all other computers in the system.
types of grid computing devices
PCs and servers
characteristics of grid computing
- all computers are spread out but connected.
- grid computing develops a ‘virtual supercomputer’ in a system.
grid computing pros and cons
pros:
- solves larger more complex problems in less time.
- easier collaboration.
- makes efficient use of existing hardware.
- less chances of failure.
cons:
- software and standards still developing
- non-interactive job submission –> unreliable
interoperability
the capability of two or more computer systems to work cooperatively (share data and resources), even though they are made by different manufacturers. the computers need to agree on how to exchange the information bringing in standards.
open standards
- These are standards that are publicly available and (normally) free to use.
- These standards are one factor aiding interoperability.
characteristics of open standards
- public availability
- collaborative development
- royalty free
- voluntary adoption
explain why distributed systems may act as a catalyst to a greater decentralisation of the web
Distributed systems are designed to distribute tasks and data across multiple computers or devices, rather than relying on a central server or location. They are decentralised by design.
lossy compression
- Reduces file-size by removing some of the data in the file.
- Once the data is removed, the data cannot be recovered.
- Generally results in a loss of quality.
- Files compressed using this form of compression are used in their compressed form.
general features of lossy compression
- looks for common patterns in data to compress
- part of original data is lost
- compresses to low file size
- usually includes settings for compression quality (gives users options)
- as data becomes compressed the quality deteriorates
loseless compression
- Reduces file size by looking for repeated patterns of data / redundant data and replacing those with a single shorter “token”.
- Tokens are associated with the data the represent by using a dictionary added to the file.
- Files must be decompressed before they can be used.
- Decompression software reads the dictionary and replaces all the tokens with the original data.
- Files do not lose any of the data they contain when compressed / decompressed.
general features of loseless compression
- only can compress 50% to their original file size
- important that the information compressed does not affect a loss in information
evaluation of lossy compression
- significant reduction in file size.
- most important use is streaming multimedia files and VoIP
- does not work with all file types
- consider compression ratio (file size to quality)
evaluation of loseless compression
- same data as the initial file
- it is required that the installation files and information is the same in compressing and decompressing phase
- no loss in quality in loseless compression
client-server architechure
an application gets split into the client side and the server side. a client-server application does also not necessarily need to be working over the internet, but could be limited to a local network.
cloud computing
relies on client-server architecture, but places focus on sharing resources over the internet. often offered as a service to individuals and companies.
cloud computing advantages
- elasticity (scale up or down depending on demand)
- pay per use (elasticity allows user to pay for the resources that they actually use)
- self-provisioning (can create account on your own)
private cloud
a cloud computing environment dedicated to a single organization
(a company owns the data centres that deliver the services to internal users only)
private cloud pros and cons
pros:
- scalability.
- self-provisioning
- direct control
- changing computer resources on demand
- limited access through firewalls improves security
cons:
- same high costs for maintenance, staffing, management
- additional costs for cloud software
public cloud
service provided by a third party and are usually available to the general public.
public cloud pros and cons
pros:
- easy and inexpensive because the provider covers hardware, application and bandwidth costs
- scalability to meet needs
- no wasted resources
- costs calculated by resource consumption only
cons:
- no control over sensitive data
- security risks
hybrid cloud
best of both private and public clouds. sensitive and critical application run in a private cloud, while the public cloud is used for apps that require high scalability.
copyright
the exclusive legal right, given to an originator or an assignee to print, publish, perform, film, or record literary, artistic, or musical material, and to authorize others to do the same.
intellectual property
A product of the intellect, such as an expressed idea or concept, that has commercial value.
privacy
the seclusion of information from others. this can relate to health care records, sensitive financial institutions, residential records. essential to prevent unauthorised access.
identification
defined as the process of claiming to be one’s identity. this process is important for privacy and is required for authentication.
authentication
proving someones identification. usually done through a username and password or two-factor authentication.
explain why the web may be creating unregulated monopolies
the world wide web should be a free place where anybody can have a website. normally comes at a cost of a domain name. in addition to reach an audience further marketing through SEO is usually necessary. Therefore, it is normally bettwer to publish content on an exsisting platform, e.g. Twitter or Blogspot.
Leads to unregulated monopolies.
positive and negative effects of a decentralised and democratic web
positive:
- more control over data
- making surveillance harder
- avoid censorship
- possibly faster speed
negative:
- barrier to usability –> novices
- less practical sometimes
- DNS alternatives necessary for legible domain names
- higher maintenance costs
white hat optimization techniques
- Include high quality website content.
- Use appropriate meta tags(keywords/descriptions).
- Separate content from formatting.
- Use site maps.
- Include a robots.text file in the page header.
black hat optimization techniques
- linkfarming
- keyword stuffing
- cloaking
- hidden text or links
search engine metrics
- Number of hits.
- Popularity of linking pages.
- Amount of links to the same page departing from the source page.
- Trustworthiness of the linking domain.
- Relevance of content between source and target page.
- Time to download.
features/characteristics of a dynamic website
- Allows user interaction.
- Allows parts of the content to be changed without uploading the complete page.
- Can connect to server-side databases.
- Can provide different views for different users.
- Includes the use of scripts / server side scripting.
characteristics of a static website
- Content can only be changed by an administrator.
- Provides the same view to all users.
- Displays exactly the information that is stored in the html file.
creative commons
- A way to manage copyrighted material especially on the web.
- Allows copyrighted materials to be shared and used in a friendly manner.
- Usage is limited to the purposes and conditions specified in the license.
- The author of the material maintains the rights on the material through the copyright.
SSL / TSL
- Protocol that allows data to be securely transmitted across a network.
- Provides authentication of server.
- Uses PKI to securely exchange keys.
black hat SEO
definition
Techniques aimed to improve the ranking of a web page that are considered misleading / unethical / designed to gain an unfair advantage.
net neutrality
Refers to the equal treatment of providers/users.
characteristics of HTTP
- is an application-layer protocol
- functions as a request–response protocol
- the protocol to exchange or transfer hypertext
- stateless - Each transaction between the client and server is independent and no state is set based on a previous transaction or condition.
- Uses requests from the client to the server and responses from the server to the client for sending and receiving data.
- utilizes headers at the start of each message.
characteristics of HTTPS
- HTTPS encrypts the request and response. If you were to snoop (or spy) on the network data, you would only (theoretically) see the origin and destination IP and port numbers
- HTTPS piggybacks (or rides) on top of HTTP
characteristics of HTML
- is a markup language. A markup language is a system for annotating a document in a way that is syntactically distinguishable from the text
- HTML markup consists of several key components, including those called tags (and their attributes), character-based data types, character references and entity references.
- HTML tags most commonly come in pairs although some represent empty elements and so are unpaired.
- The first tag in such a pair is the start tag, and the second is the end tag (they are also called opening tags and closing tags)
characteristics of CSS
- CSS has selectors. Selectors declare which part of the markup a style applies to by matching tags and attributes in the markup itself.
- CSS has declaration block. A declaration block consists of a list of declarations in braces. Each declaration itself consists of a property, a colon (:), and a value.
- CSS has specificity. Specificity refers to the relative weights of various rules. It determines which styles apply to an element when more than one rule could apply.
- CSS has inheritance. Inheritance is a key feature in CSS; it relies on the ancestor-descendant relationship to operate. Inheritance is the mechanism by which properties are applied not only to a specified element, but also to its descendants
purpose of a URL
easily retrieve network resources and facilitate linking to resources
purposes of protocols
- Ensure data integrity (overall completeness, accuracy and consistency of data)
- Regulate flow control (In data communications, flow control is the process of managing the rate of data transmission between two nodes to prevent a fast sender from overwhelming a slow receiver)
- Manage deadlock (A condition that occurs when two processes are each waiting for the other to complete before proceeding)
- Manage congestion (Network congestion in data networking is the reduced quality of service that occurs when a network node or link is carrying more data than it can handle)
- Manage error checking (techniques that enable reliable delivery of digital data over unreliable communication channels)
difference between standard and protocol
a standard is a set of guidelines or requirements that define how something should be done, while a protocol is a set of rules that define how devices should communicate with each other
characteristics of IP
- Connectionless - No connection with the destination is established before sending data packets.
- Best Effort (unreliable) - Packet delivery is not guaranteed.
- Media Independent - Operation is independent of the medium carrying the data.
common GUI components of a web browser
- Address bar for inserting a URI
- Back and forward buttons
- Bookmarking options
- Refresh and stop buttons for refreshing or stopping the loading of current documents
- Home button that takes you to your home page
software components of a web browser
- User interface: this includes the address bar, back/forward button, bookmarking menu, etc.
- Browser engine: marshals actions between the UI and the rendering engine.
- Rendering engine: responsible for displaying requested content.
- Networking: for network calls such as HTTP requests
- UI backend: used for drawing basic widgets like combo boxes and windows. This backend exposes a generic interface that is not platform specific. Underneath it uses operating system user interface methods.
- JavaScript interpreter: Used to parse and execute JavaScript code.
- Data storage: This is a persistence layer. The browser may need to save all sorts of data locally, such as cookies. Browsers also support storage mechanisms such as localStorage, IndexedDB, WebSQL and FileSystem.
parallel web crawler
define
A parallel crawler is a crawler that runs multiple processes in parallel.
why is compression of data necessary?
- Reducing bandwidth requirements
- Improving transmission speeds
- Reducing storage requirements
- Reducing costs
benefits of decompression software
+ mention drawbacks
- Improved transmission speeds: Decompressing data can improve the speed at which it is transmitted, as it reduces the amount of data that needs to be transmitted.
- Reduced storage requirements: Decompressing data can also reduce the amount of storage space required to store the data, which can be beneficial for both the sender and the receiver of the data.
- Reduced costs: Reducing the amount of data that needs to be transmitted or stored can also help to reduce costs, such as the costs of bandwidth or storage.
- Improved compatibility: Decompressing data can also improve compatibility with different systems or applications, as it allows the data to be accessed and used in its original form.
However, it is important to note that decompression software may also have some limitations or drawbacks, such as the need for additional processing power and the potential for security vulnerabilities if the decompression software is not properly secured.
how has the web has supported new methods of online interaction
- Connecting people
- Facilitating communication
- Providing access to information
- Enabling new forms of expression
characteristics of cloud computing
- hosted on the internet
- often location independent
- SaaS, PaaS, etc. can be used to replace local storage and processing with relevant implications for the organization
- Scalability and flexibility
- Pay as per the use
cloud computing vs client-server model
- Cloud computing is often offered as a service to individuals and companies whereas local client server architecture is based at an organizational level
- Cloud computing can scale up or down depending on current demands more easily than local client server networks
- Client / server networks are more secure than the cloud as the data transmission is carried out locally
- Client / server networks have lower levels of latency than the cloud as the data transmission is carried out locally;