Module 1 Flashcards

Question 1

Q

What is Data Mining?

Answer

A

the process of extracting (aka mining) knowledge from data

Question 2

Q

What is Machine Learning?

Answer

A

a technique or method in which knowledge is extracted from data. The process of applying a machine learning technique on the data to extract knowledge is referred to as data mining.

Question 3

Q

What are 3 purposes of data?

Answer

A

describing or diagnosing a phenomenon.
2.predicting events or changes based on the available data.
3.creating a system that use data objects and mimics a cognitive capability of a human behavior, e.g., finding a cat in a picture, understanding a handwritten text, chatting with you about something, etc.

Question 4

Q

How can we describe or diagnose a phenomenon using data?

Answer

A

we use classification, regression, or clustering, i.e., categorizing data with similar properties together

Question 5

Q

How can we predict events or changes based on available data?

Answer

A

prediction, i.e., use of the existing data to describe what will happen in the future.

Question 6

Q

What is Artificial Intelligence?

Answer

A

creating a system that use data objects and mimics a cognitive capability of a human behavior, e.g., finding a cat in a picture, understanding a handwritten text, chatting with you about something, etc.

Question 7

Q

What is the standard format for web documents?

Answer

A

HTML (Hyper Text Markup Language)

Question 8

Q

What are the two main components of a web page. Describe them

Answer

A

The header part of the page presents an introduction to, and meta information about, the information that exists on that page.

The body contains the actual text of the page

Question 9

Q

What is an API?

Answer

A

An application programmable interface that allows consumers to collect data from websites

Question 10

Q

How can a company analyze its own system?

Answer

A

They can use their own server log files

Question 11

Q

What is page tagging?

Answer

A

The collection of users’ data via the cookies installed b the web page on the data. They can collect data on browser version, operating system, screen size, etc.

Question 12

Q

What is web scraping?

Answer

A

The process of automatically collecting data from web pages or web resources. It focuses on a single source of information. Another name for it is Web Knowledge Extraction

Question 13

Q

What is web crawling?

Answer

A

The process of reading and storing all web pages of a site or number of sites. It is related to gathering pages from the web and indexing them to support a search engine. This downloads the entire website (which is comprised of many web pages)

Question 14

Q

Who/ What primarily uses web crawlers?

Answer

A

Web crawling is heavily used by search engines that download documents of a web page and then store the docs in their local data base.

Question 15

Q

What is an inverted index?

Answer

A

An inverted index is a map of keywords and their location used to access the database of web documents

Question 16

Q

What are some other phrases for web crawling?

Answer

A

Web spider, web robots

Question 17

Q

What are some examples of web crawling?

Answer

A

Collecting email addresses (spammer), indexing web pages for fast access in search engines, extracting the best financial offer (airfare ticket purchase application)

Question 18

Q

What is WebSPHINX?

Answer

A

A free tool that is a web crawler

Question 19

Q

What is Robots.txt?

Answer

A

This is a text file that is created by the owner of a web site. It defines which pages or resources can and cannot be crawled. The main commands are allow and disallow

Question 20

Q

What are some examples of applications that benefit from web scraping?

Answer

A

Market forecasting and market studies (scraping online product reviews from Amazon, identifying public opinions), machine language translation (using web text as a template to reconstruct a sentence correctly), Medical diagnostics (retrieve and analyze data from news sites, translated texts, health forums), Opinion mining from social/new media, epidemic propagation (influenza based on geolocation or amount of hate speech (tweets) during different times)

Question 21

Q

What are examples of web scraping libraries in R and Python

Answer

A

rvest in R and BeautifulSoup in Python

Question 22

Q

What are the 3 types of intellectual properties?

Answer

A

Trademarks 2. Copyrights 3. Patents

Question 23

Q

What is a patent?

Answer

A

patent is a property right granted by the Government of the United States of America to an inventor “to exclude others from making, using, offering for sale, or selling the invention throughout the United States or importing the invention into the United States” for a limited time in exchange for public disclosure of the invention when the patent is granted. They are used to declare ownership over inventions only, not digital properties. You can not patent images, text, or any information itself. Software can have patents since it is the technique that is patented not the information.

Question 24

Q

What is a trademark?

Answer

A

A trademark is a word, phrase, symbol, and/or design that identifies and distinguishes the source of the goods of one party from those of others. A service mark is a word, phrase, symbol, and/or design that identifies and distinguishes the source of a service rather than goods. The term “trademark” is often used to refer to both trademarks and service marks.

Question 25

Q

What is copyright?

Answer

A

Copyright is a type of intellectual property that protects original works of authorship as soon as an author fixes the work in a tangible form of expression. In copyright law, there are a lot of different types of works, including paintings, photographs, illustrations, musical compositions, sound recordings, computer programs, books, poems, blog posts, movies, architectural works, plays, and so much more!

Question 26

Q

What is the difference between original work and fixed work for copyright?

Answer

A

Original work refers to work that is independently created by a human author and has minimal level of creativity. In this context, independent creation refers to creation by a human author, without copying from other resources.
Fixed work refers to when the work is captured in a permanent medium such that the work can be perceived, reproduced, or communicated for more than a short time. For example, a work is fixed when we write it down or record it. Extends to creative works only

Question 27

Q

What are copyright owners entitled to?

Answer

A

Reproduce the work in copies or phono records.
Prepare derivative works based on the work.
Distribute copies or phonorecords of the work to the public by sale or other transfer of ownership or by rental, lease, or lending.
Perform the work publicly if it is a literary, musical, dramatic, or choreographic work; a pantomime; or a motion picture or other audiovisual work.
Display the work publicly if it is a literary, musical, dramatic, or choreographic work; a pantomime; or a pictorial, graphic, or sculptural work. This right also applies to the individual images of a motion picture or other audiovisual work.
Perform the work publicly by means of a digital audio transmission if the work is a sound recording.

Question 28

Q

What is Trespass to Chattel?

Answer

A

This refers to intentional interference with another individual’s property.

Question 29

Q

What criteria needs to be met for a web scraper to violate trespass to chattel?

Answer

A

Three criteria need to be met (all together) for a web scraper to violate trespass to chattel:

Lack of consent: Web servers are open to everyone; they are generally “giving consent” to web scrapers as well. However, many websites’ terms of service agreements specifically prohibit the use of scrapers in a file called robot.txt. In addition, any explicit notices delivered to you from the web server revoke this consent.
Actual harm: Servers are expensive properties. Besides, if the scrapers take a website down, or limit its ability to serve other users, this can add to the “harm” you cause.
Intentionality: If we are writing the code that perform some harm, such as DDoS attack.

Question 30

Q

What is ytdl-org used for?

Answer

A

This is a very robust command line web scraping library especially for video files

Question 31

Q

What is Selenium?

Answer

A

It is a multiplatform scraping library that can simulate human behavior for downloading from a web page

Question 32

Q

What is Scrapy?

Answer

A

Another multiplatform scraping library

Question 33

Q

Brainscape's Knowledge GenomeTM

Module 1 Flashcards

Brainscape's Knowledge Genome^TM