12 Web Scraping Flashcards

1
Q

Name 4 modules that are useful for web scraping

A
  1. webbrowser: Comes with Python and opens a browser to a specific page.
  2. requests: Downloads files and web pages from the internet.
  3. bs4: Parses HTML, the format that web pages are written in.
  4. selenium: Launches and controls a web browser. The selenium module is able to fill in forms and simulate mouse clicks in this browser.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do you open a new tab browser with a program?

A

> > > import webbrowser

|&raquo_space;>webbrowser.open(‘link’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can you

  1. download a webpages content?
  2. check whether the download worked:?
  3. check more easily + stopping the program
  4. When should you whether the download worked?
A

1.
»>import requests
»>res = requests.get(‘link’)

2.
look if
»>res.status_code ==200

  1. > > > res.raise_for_status()
    You want to fail fast, e.g. when the download didnt work
  2. Always after using requests.get! ALWAYS!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How can you write a websites content to a file?

A

Using res´ iter_content method!

  1. Call requests.get() to download the file.
  2. Call open() with ‘wb’ to create a new file in write binary mode.
  3. Loop over the Response object´s iter_content() method.
  4. Call write() on each iteration to write the content to the file.
  5. Call close to the file.
>>>res = requests.get('link')
>>>res.raise_for_status()
>>>text = open('newFile.txt', 'wb')
>>>for chunk in res.iter_content(100.000):
      text.write(chunk)
>>>text.close
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is parsing?

What module is good for it?

A

Analyzing and identifying parts of an HTML

bs4 (BeautifulSoup)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do you create a bs4 object from HTML?

A
>>>imoprt bs4, requests
>>>res = requests.get('link.com') 
      #also possible: html file from hdd
>>>res.raise_for_status()
>>>soupObj = bs4.BeautifulSoup(res.text, 'html.parser')
>>>type(soubObj)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How can you find an element with bs4? What does it return?

What are selectors similar to?

A

Use the select method on the bs4 object! It will return a list of ‘Tag’ objects (one tag for every match!)

> > > elem = soupObj.select(‘selector’)

These selectors work like regex objects, identifying certain patterns. Different selectors can also be combined, e.g.:

> > > soupObj.select(‘p #author’)

Matches every author id attribute, that is within a <p> element
</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Name example CSS selectors, how do you select an:

  1. an element
  2. an element with an id attribute
  3. all elements that use a css class attr
  4. an element within another element
  5. an element that is directly within another element (without any other elems in between)
  6. all elements that have an attribute with any value
  7. all elements that have an attribute with a specified value
A

Name example CSS selectors, how do you select an:
1. an element
»>soupObj.select(‘element’)

  1. an element with an id attribute
    »>soupObj.select(‘#idAttribute’) #e.g. author
3. all elements that use a css class attr
>>>soupObj.select('.cssClassAttr')
  1. an element within another element
    »»»soupObj.select(‘element1 element2’)
  2. an element that is directly within another element (without any other elems in between)
    »>soupObj.select(‘element1 > element2’)
  3. all elements that have an attribute with any value
    »>soupObj.select(‘element[attribute]’)
  4. all elements that have an attribute with a specified value
    »>soupObj.select(‘element[attribute=”valueXY”]’)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can you create tags and store them in a variable ‘elem’? Once you did that, how do you:

  1. Find out how many matches you have?
  2. Create a string of the tag object
  3. Get the elements inner HTML
  4. Create a dictionary with elems ID and attribute
  5. Search a certain pattern, e.g. of links
A

Using the select method:
»>elem = soup.select(‘whatever’)

  1. Find out how many matches you have?
    »>len(elem)
  2. Create a string of the tag object
    »>str(elem[i])
  3. Get the elements inner HTML
    »>elem[i].text
  4. Create a dictionary with elems ID and attribute
    »>elem[i].attrs
  5. Search a certain pattern, e.g. of links
    »>elem[i].get(‘pattern’) e.g.:
    »>elem[i].get(‘href’)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
  1. How can you open a browser page with selenium?
  2. Name a (dis)advantage of selenium
  3. What is the structure for finding
    a. The first element
    b. All elements
    c. What are the return values?
A

1.
»>from selenium import webdriver
»>browser = webdriver.Firefox()
»>browser.get(‘websiteLink’)

  1. Pro: The ‘user-agent-string’ of selenium passes as more ‘human’
    Con: Slower
  2. a. find_element_* –> single webElObj
    b. find_elements_* –> list of webElObj
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Using selenium, which methods help you find:

1. Elements that use the CSS
class name
2. Elements that match the CSS
selector
3. Elements with a matching id
attribute value
4. elements that completely
match the text provided
5. elements that contain the
text provided
6. Elements with a matching name
attribute value
7. Elements with a matching tag name
(case-insensitive; an element is
matched by 'a' and 'A')

What do you need to be careful about?

A

Using selenium, how do you find:

1. Elements that use the CSS
class name
>>>browser.find_element_by_class_name(name)
  1. Elements that match the CSS
    selector
    »>browser.find_element_by_css_selector(selector)
  2. Elements with a matching id
    attribute value
    »>browser.find_element_by_id(id)
  3. elements that completely
    match the text provided
    »>browser.find_element_by_link_text(text)
  4. elements that contain the
    text provided
    »>browser.find_element_by_partial_link_text(text)
  5. Elements with a matching name
    attribute value
    »>browser.find_element_by_name(name)
  6. Elements with a matching tag name
    (case-insensitive; an element is
    matched by ‘a’ and ‘A’)
    »>browser.find_element_by_tag_name(name)

Every method is case sensitive except 7.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

8x unimportant selenium attributes

A
tag_name
get_attribute(name)
text
clear()
is_displayed()
is_enabled()
is_selected()
location
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Give an example of logging in with selenium

A

> > > from selenium import webdriver
browser = webdriver.Firefox()
browser.get(‘https://login.metafilter.com’)
userElem = browser.find_element_by_id(‘user_name’)
userElem.send_keys(‘your_real_username_here’)
passwordElem = browser.find_element_by_id(‘user_pass’)
passwordElem.send_keys(‘your_real_password_here’)
passwordElem.submit()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Using selenium, how can you:

  1. type in keys
  2. click
  3. submit
A

First step, always find the element using the selenium methods and save it to a variable, e.g.:

userElem = browser.find_element_by_XY(‘XY’)

  1. type in keys
    Usually, you have to find the or element
    »>userElem.send_keys(‘username’)
  2. click
    »>userElem.click()
  3. submit
    »>userElem.submit()
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Using selenium, which methods allow you to:

  1. Go back
  2. Go forward
  3. Refresh
  4. Quit
A
  • = element saved to variable
  1. Go back
    * .browser.back()
  2. Go forward
    * .browser.forward()
  3. Refresh
    * .browser.refresh()
  4. Quit
    * .browser.quit()
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Using selenium, how can you send keys that cannot be put into a string value?

E.g.:

  1. Scroll to the top
  2. Scroll to the bottom
A

“Special keys” can be used with an additional selenium module!

> > > from selenium import webdriver
from selenium.webdriver.common.keys import Keys

  1. Scroll to the top
    * .send_keys(Keys.HOME)
  2. Scroll to the bottom
    * .send_keys(Keys.END)
17
Q

How could you use parsing, so that you can print out every linkon a website, combined with the elements text inside, in a format like this:

“Text description: link”

(Youtube example)

A

import requests, bs4

res = requests.get(‘https://www.youtube.com/’)
soup = bs4.BeautifulSoup(res.text, ‘html.parser’)
elem = soup.select(‘a’)
for i in range(len(elem)):
print((elem[i].text) + ‘: ‘ + elem[i].get(‘href’))

18
Q

When downloading sth. with requests, how can you change (or stop) the Messages like “InvalidSchema” or “MissingSchema” from appearing?

A

> > > except requests.exceptions.InvalidSchema as error:
print(f’Error: {error}.’)

Error: Invalid URL ‘/wir/wir-stellen-uns-vor/nachrichten/’: No schema supplied.