Week 1 - Web Data Collection I Flashcards
What is Data Science
Any field of research that involves the processing of large amounts of data in order to provide insights into real-world problems
Why do we need Data Science?
Because we address a diverse range of crime problems using data. DS provides powerful computational techniques to model data and uncover hidden patterns. We are critical consumers of data science.
Why is DS useful?
- Develop deeper understanding of issues
- Evidence-based informed decisions
- Build effective applications to prevent crime
Why is DS in popular demand within crime and security?
Skill shortage: the availability of people with specialist data skills in UK is not sufficient to meet demand
What are examples of natural language (text) data?
news, websites, social media, repositories of electronic papers, blog posts, underground marketplaces, e-commerce sites
What are the possible threats of evolving data science tools and threats?
Creates malware, generates fake news, phishing scams, automating fraudulent activities, social engineering, identity theft
How can you apply web data?
To your own liking, humanity is connected by technology on the web to their own liking and use
What are common and heightened security threats?
- Human trafficking and child exploitation
- Deepfake detection
- Fraud detection
How do obtain data from the web?
- Data is availabnle in a downloadable format and open to public
- Manual approaches - time consuming, prone to errors, and not practical
- Automated approaches - application programming interace (API) and web scraping
What is the API?
API stands for Application programming interface.
It is a collection of programming code to help software programs to communicate with each other and retrieve data. APIs enable data transmission in an automatable and efficient manner.
How do APIs work?
- Make a request
- The client will send the request to the API server
- API server processes the request, retrieves the data from the database
- Sends the data to the client
- Message response: plain text, HTML, XML, and JSON
What are the different types of APIs?
Public APIs, partner APIs, private APIs
What are API protocols?
It defines how to interact and understand with each other; they define the rules, restrictions and limitations of communication
What are some tools for communicating with APIs from R?
Some platforms may provide APIs with their own R package. Wrap API calls for service into a set of easy-to-use programming functions that you can use.
What are the restrictions of APIs?
- Most APIs will have some form of access control.
- API have rate limits that control usage and manage traffic
- Rate limits are often calculated in Requests Per Second
- Restrictions on what you can access and how much data you can access, and what you can do with the data (copyright)
- Might charge a fee
How would you install and request data?
install.packages (“guardianapi”)
library(guardianapi)
gu_api_key()
request.crimesec <- gu_content (query = xyz, from_date= xyz, to_date= xyz)
dim(request.crimesec
How do you get the names of the object?
names (variable name)
How do you get the view?
view (variable name)
How do you get the article data? And from a specific section?
name of variable$headline[value]
name of variable$body[value]
Specific section:
name of variable1 <- name of variable[name of variable$section_id==”name of section”,]
dim(name of variable1)
name of variable1$headline
How do you get the table?
variable <- table(name of variable$section_id)
What are the adv and disadv of APIs?
adv:
1. direct, easy and effective and clean access to data
2. availability of programming packages, documentation and online tutorials
3. you can use API when the structure of website changes
disadv:
1. mercy of the API providers
2. rate limits
3. not all platforms provide APIs (e.g., local newspapers)
What is Web Scraping?
The process of automatically extracting specific content from a website and transforming it into structured data
Challenges of web scraping?
It is time consuming, learning curve, some webs may prohibit scraping
What is HTML?
HTML - Hyper Text Markup Language; the standard markup language that describes the structure of a web page
What is CSS?
CSS - Cascading Style Sheets is a style sheet language used to describe the presentation of HTML based docs
What is JavaScript?
A programming language adds behaviour to web pages by making it more interactive and engaging for users
Name all the description of the HTML tags:
1. html
2. head
3. title
4. body
5. p
6. b
7. del
8. ins
9. mark
10. img src
11. ul
12. li
- root of HTML document
- head node of HTML document
- page title
- body stores content of the HTML document
- paragraph
- make some text bold
- defines a text deleted
- text that has been inserted
- highlights some text
- inserts image from source
- creates an unordered list
- item in a list
What are the 3 ways to apply CSS to HTML documents?
- inline - inside the HTML element using the style attribute
- internally - part of the HTML document within the head section
- externally - separate CSS file and called within the html document