Week 1 - Web Data Collection I Flashcards

1
Q

What is Data Science

A

Any field of research that involves the processing of large amounts of data in order to provide insights into real-world problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
1
Q

Why do we need Data Science?

A

Because we address a diverse range of crime problems using data. DS provides powerful computational techniques to model data and uncover hidden patterns. We are critical consumers of data science.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is DS useful?

A
  1. Develop deeper understanding of issues
  2. Evidence-based informed decisions
  3. Build effective applications to prevent crime
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why is DS in popular demand within crime and security?

A

Skill shortage: the availability of people with specialist data skills in UK is not sufficient to meet demand

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are examples of natural language (text) data?

A

news, websites, social media, repositories of electronic papers, blog posts, underground marketplaces, e-commerce sites

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the possible threats of evolving data science tools and threats?

A

Creates malware, generates fake news, phishing scams, automating fraudulent activities, social engineering, identity theft

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How can you apply web data?

A

To your own liking, humanity is connected by technology on the web to their own liking and use

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are common and heightened security threats?

A
  1. Human trafficking and child exploitation
  2. Deepfake detection
  3. Fraud detection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do obtain data from the web?

A
  1. Data is availabnle in a downloadable format and open to public
  2. Manual approaches - time consuming, prone to errors, and not practical
  3. Automated approaches - application programming interace (API) and web scraping
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the API?

A

API stands for Application programming interface.

It is a collection of programming code to help software programs to communicate with each other and retrieve data. APIs enable data transmission in an automatable and efficient manner.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do APIs work?

A
  1. Make a request
  2. The client will send the request to the API server
  3. API server processes the request, retrieves the data from the database
  4. Sends the data to the client
  5. Message response: plain text, HTML, XML, and JSON
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the different types of APIs?

A

Public APIs, partner APIs, private APIs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are API protocols?

A

It defines how to interact and understand with each other; they define the rules, restrictions and limitations of communication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are some tools for communicating with APIs from R?

A

Some platforms may provide APIs with their own R package. Wrap API calls for service into a set of easy-to-use programming functions that you can use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the restrictions of APIs?

A
  1. Most APIs will have some form of access control.
  2. API have rate limits that control usage and manage traffic
  3. Rate limits are often calculated in Requests Per Second
  4. Restrictions on what you can access and how much data you can access, and what you can do with the data (copyright)
  5. Might charge a fee
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How would you install and request data?

A

install.packages (“guardianapi”)
library(guardianapi)
gu_api_key()

request.crimesec <- gu_content (query = xyz, from_date= xyz, to_date= xyz)

dim(request.crimesec

16
Q

How do you get the names of the object?

A

names (variable name)

17
Q

How do you get the view?

A

view (variable name)

18
Q

How do you get the article data? And from a specific section?

A

name of variable$headline[value]

name of variable$body[value]

Specific section:
name of variable1 <- name of variable[name of variable$section_id==”name of section”,]

dim(name of variable1)

name of variable1$headline

19
Q

How do you get the table?

A

variable <- table(name of variable$section_id)

20
Q

What are the adv and disadv of APIs?

A

adv:
1. direct, easy and effective and clean access to data
2. availability of programming packages, documentation and online tutorials
3. you can use API when the structure of website changes

disadv:
1. mercy of the API providers
2. rate limits
3. not all platforms provide APIs (e.g., local newspapers)

21
Q

What is Web Scraping?

A

The process of automatically extracting specific content from a website and transforming it into structured data

22
Q

Challenges of web scraping?

A

It is time consuming, learning curve, some webs may prohibit scraping

23
Q

What is HTML?

A

HTML - Hyper Text Markup Language; the standard markup language that describes the structure of a web page

24
Q

What is CSS?

A

CSS - Cascading Style Sheets is a style sheet language used to describe the presentation of HTML based docs

25
Q

What is JavaScript?

A

A programming language adds behaviour to web pages by making it more interactive and engaging for users

26
Q

Name all the description of the HTML tags:
1. html
2. head
3. title
4. body
5. p
6. b
7. del
8. ins
9. mark
10. img src
11. ul
12. li

A
  1. root of HTML document
  2. head node of HTML document
  3. page title
  4. body stores content of the HTML document
  5. paragraph
  6. make some text bold
  7. defines a text deleted
  8. text that has been inserted
  9. highlights some text
  10. inserts image from source
  11. creates an unordered list
  12. item in a list
27
Q

What are the 3 ways to apply CSS to HTML documents?

A
  1. inline - inside the HTML element using the style attribute
  2. internally - part of the HTML document within the head section
  3. externally - separate CSS file and called within the html document