12 Web Scraping Flashcards

Question 1

Q

Name 4 modules that are useful for web scraping

Answer

A

webbrowser: Comes with Python and opens a browser to a specific page.
requests: Downloads files and web pages from the internet.
bs4: Parses HTML, the format that web pages are written in.
selenium: Launches and controls a web browser. The selenium module is able to fill in forms and simulate mouse clicks in this browser.

Question 2

Q

How do you open a new tab browser with a program?

Answer

A

> > > import webbrowser

|&raquo_space;>webbrowser.open(‘link’)

Question 3

Q

How can you

download a webpages content?
check whether the download worked:?
check more easily + stopping the program
When should you whether the download worked?

Answer

A

1.
»>import requests
»>res = requests.get(‘link’)

2.
look if
»>res.status_code ==200

> > > res.raise_for_status()
You want to fail fast, e.g. when the download didnt work
Always after using requests.get! ALWAYS!

Question 4

Q

How can you write a websites content to a file?

Answer

A

Using res´ iter_content method!

Call requests.get() to download the file.
Call open() with ‘wb’ to create a new file in write binary mode.
Loop over the Response object´s iter_content() method.
Call write() on each iteration to write the content to the file.
Call close to the file.

>>>res = requests.get('link')
>>>res.raise_for_status()
>>>text = open('newFile.txt', 'wb')
>>>for chunk in res.iter_content(100.000):
      text.write(chunk)
>>>text.close

Question 5

Q

What is parsing?

What module is good for it?

Answer

A

Analyzing and identifying parts of an HTML

bs4 (BeautifulSoup)

Question 6

Q

How do you create a bs4 object from HTML?

Answer

A

>>>imoprt bs4, requests
>>>res = requests.get('link.com') 
      #also possible: html file from hdd
>>>res.raise_for_status()
>>>soupObj = bs4.BeautifulSoup(res.text, 'html.parser')
>>>type(soubObj)

Question 7

Q

How can you find an element with bs4? What does it return?

What are selectors similar to?

Answer

A

Use the select method on the bs4 object! It will return a list of ‘Tag’ objects (one tag for every match!)

> > > elem = soupObj.select(‘selector’)

These selectors work like regex objects, identifying certain patterns. Different selectors can also be combined, e.g.:

> > > soupObj.select(‘p #author’)

Matches every author id attribute, that is within a <p> element
</p>

Question 8

Q

Name example CSS selectors, how do you select an:

an element
an element with an id attribute
all elements that use a css class attr
an element within another element
an element that is directly within another element (without any other elems in between)
all elements that have an attribute with any value
all elements that have an attribute with a specified value

Answer

A

Name example CSS selectors, how do you select an:
1. an element
»>soupObj.select(‘element’)

an element with an id attribute
»>soupObj.select(‘#idAttribute’) #e.g. author

3. all elements that use a css class attr
>>>soupObj.select('.cssClassAttr')

an element within another element
»»»soupObj.select(‘element1 element2’)
an element that is directly within another element (without any other elems in between)
»>soupObj.select(‘element1 > element2’)
all elements that have an attribute with any value
»>soupObj.select(‘element[attribute]’)
all elements that have an attribute with a specified value
»>soupObj.select(‘element[attribute=”valueXY”]’)

Question 9

Q

How can you create tags and store them in a variable ‘elem’? Once you did that, how do you:

Find out how many matches you have?
Create a string of the tag object
Get the elements inner HTML
Create a dictionary with elems ID and attribute
Search a certain pattern, e.g. of links

Answer

A

Using the select method:
»>elem = soup.select(‘whatever’)

Find out how many matches you have?
»>len(elem)
Create a string of the tag object
»>str(elem[i])
Get the elements inner HTML
»>elem[i].text
Create a dictionary with elems ID and attribute
»>elem[i].attrs
Search a certain pattern, e.g. of links
»>elem[i].get(‘pattern’) e.g.:
»>elem[i].get(‘href’)

Question 10

Q

How can you open a browser page with selenium?
Name a (dis)advantage of selenium
What is the structure for finding
a. The first element
b. All elements
c. What are the return values?

Answer

A

1.
»>from selenium import webdriver
»>browser = webdriver.Firefox()
»>browser.get(‘websiteLink’)

Pro: The ‘user-agent-string’ of selenium passes as more ‘human’
Con: Slower
a. find_element_* –> single webElObj
b. find_elements_* –> list of webElObj

Question 11

Q

Using selenium, which methods help you find:

1. Elements that use the CSS
class name
2. Elements that match the CSS
selector
3. Elements with a matching id
attribute value
4. elements that completely
match the text provided
5. elements that contain the
text provided
6. Elements with a matching name
attribute value
7. Elements with a matching tag name
(case-insensitive; an element is
matched by 'a' and 'A')

What do you need to be careful about?

Answer

A

Using selenium, how do you find:

1. Elements that use the CSS
class name
>>>browser.find_element_by_class_name(name)

Elements that match the CSS
selector
»>browser.find_element_by_css_selector(selector)
Elements with a matching id
attribute value
»>browser.find_element_by_id(id)
elements that completely
match the text provided
»>browser.find_element_by_link_text(text)
elements that contain the
text provided
»>browser.find_element_by_partial_link_text(text)
Elements with a matching name
attribute value
»>browser.find_element_by_name(name)
Elements with a matching tag name
(case-insensitive; an element is
matched by ‘a’ and ‘A’)
»>browser.find_element_by_tag_name(name)

Every method is case sensitive except 7.

Question 12

Q

8x unimportant selenium attributes

Answer

A

tag_name
get_attribute(name)
text
clear()
is_displayed()
is_enabled()
is_selected()
location

Question 13

Q

Give an example of logging in with selenium

Answer

A

> > > from selenium import webdriver
browser = webdriver.Firefox()
browser.get(‘https://login.metafilter.com’)
userElem = browser.find_element_by_id(‘user_name’)
userElem.send_keys(‘your_real_username_here’)
passwordElem = browser.find_element_by_id(‘user_pass’)
passwordElem.send_keys(‘your_real_password_here’)
passwordElem.submit()

Question 14

Q

Using selenium, how can you:

type in keys
click
submit

Answer

A

First step, always find the element using the selenium methods and save it to a variable, e.g.:

userElem = browser.find_element_by_XY(‘XY’)

type in keys
Usually, you have to find the or element
»>userElem.send_keys(‘username’)
click
»>userElem.click()
submit
»>userElem.submit()

Question 15

Q

Using selenium, which methods allow you to:

Go back
Go forward
Refresh
Quit

Answer

A

= element saved to variable

Go back
* .browser.back()
Go forward
* .browser.forward()
Refresh
* .browser.refresh()
Quit
* .browser.quit()

Question 16

Q

Using selenium, how can you send keys that cannot be put into a string value?

E.g.:

Scroll to the top
Scroll to the bottom

Answer

Study These Flashcards

A

“Special keys” can be used with an additional selenium module!

> > > from selenium import webdriver
from selenium.webdriver.common.keys import Keys

Scroll to the top
* .send_keys(Keys.HOME)
Scroll to the bottom
* .send_keys(Keys.END)

Question 17

Q

How could you use parsing, so that you can print out every linkon a website, combined with the elements text inside, in a format like this:

“Text description: link”

(Youtube example)

Answer

Study These Flashcards

A

import requests, bs4

res = requests.get(‘https://www.youtube.com/’)
soup = bs4.BeautifulSoup(res.text, ‘html.parser’)
elem = soup.select(‘a’)
for i in range(len(elem)):
print((elem[i].text) + ‘: ‘ + elem[i].get(‘href’))

Question 18

Q

When downloading sth. with requests, how can you change (or stop) the Messages like “InvalidSchema” or “MissingSchema” from appearing?

Answer

Study These Flashcards

A

> > > except requests.exceptions.InvalidSchema as error:
print(f’Error: {error}.’)

Error: Invalid URL ‘/wir/wir-stellen-uns-vor/nachrichten/’: No schema supplied.

12 Web Scraping Flashcards

(18 cards)