12 Web Scraping Flashcards
Name 4 modules that are useful for web scraping
- webbrowser: Comes with Python and opens a browser to a specific page.
- requests: Downloads files and web pages from the internet.
- bs4: Parses HTML, the format that web pages are written in.
- selenium: Launches and controls a web browser. The selenium module is able to fill in forms and simulate mouse clicks in this browser.
How do you open a new tab browser with a program?
> > > import webbrowser
|»_space;>webbrowser.open(‘link’)
How can you
- download a webpages content?
- check whether the download worked:?
- check more easily + stopping the program
- When should you whether the download worked?
1.
»>import requests
»>res = requests.get(‘link’)
2.
look if
»>res.status_code ==200
- > > > res.raise_for_status()
You want to fail fast, e.g. when the download didnt work - Always after using requests.get! ALWAYS!
How can you write a websites content to a file?
Using res´ iter_content method!
- Call requests.get() to download the file.
- Call open() with ‘wb’ to create a new file in write binary mode.
- Loop over the Response object´s iter_content() method.
- Call write() on each iteration to write the content to the file.
- Call close to the file.
>>>res = requests.get('link') >>>res.raise_for_status() >>>text = open('newFile.txt', 'wb') >>>for chunk in res.iter_content(100.000): text.write(chunk) >>>text.close
What is parsing?
What module is good for it?
Analyzing and identifying parts of an HTML
bs4 (BeautifulSoup)
How do you create a bs4 object from HTML?
>>>imoprt bs4, requests >>>res = requests.get('link.com') #also possible: html file from hdd >>>res.raise_for_status() >>>soupObj = bs4.BeautifulSoup(res.text, 'html.parser') >>>type(soubObj)
How can you find an element with bs4? What does it return?
What are selectors similar to?
Use the select method on the bs4 object! It will return a list of ‘Tag’ objects (one tag for every match!)
> > > elem = soupObj.select(‘selector’)
These selectors work like regex objects, identifying certain patterns. Different selectors can also be combined, e.g.:
> > > soupObj.select(‘p #author’)
Matches every author id attribute, that is within a <p> element
</p>
Name example CSS selectors, how do you select an:
- an element
- an element with an id attribute
- all elements that use a css class attr
- an element within another element
- an element that is directly within another element (without any other elems in between)
- all elements that have an attribute with any value
- all elements that have an attribute with a specified value
Name example CSS selectors, how do you select an:
1. an element
»>soupObj.select(‘element’)
- an element with an id attribute
»>soupObj.select(‘#idAttribute’) #e.g. author
3. all elements that use a css class attr >>>soupObj.select('.cssClassAttr')
- an element within another element
»»»soupObj.select(‘element1 element2’) - an element that is directly within another element (without any other elems in between)
»>soupObj.select(‘element1 > element2’) - all elements that have an attribute with any value
»>soupObj.select(‘element[attribute]’) - all elements that have an attribute with a specified value
»>soupObj.select(‘element[attribute=”valueXY”]’)
How can you create tags and store them in a variable ‘elem’? Once you did that, how do you:
- Find out how many matches you have?
- Create a string of the tag object
- Get the elements inner HTML
- Create a dictionary with elems ID and attribute
- Search a certain pattern, e.g. of links
Using the select method:
»>elem = soup.select(‘whatever’)
- Find out how many matches you have?
»>len(elem) - Create a string of the tag object
»>str(elem[i]) - Get the elements inner HTML
»>elem[i].text - Create a dictionary with elems ID and attribute
»>elem[i].attrs - Search a certain pattern, e.g. of links
»>elem[i].get(‘pattern’) e.g.:
»>elem[i].get(‘href’)
- How can you open a browser page with selenium?
- Name a (dis)advantage of selenium
- What is the structure for finding
a. The first element
b. All elements
c. What are the return values?
1.
»>from selenium import webdriver
»>browser = webdriver.Firefox()
»>browser.get(‘websiteLink’)
- Pro: The ‘user-agent-string’ of selenium passes as more ‘human’
Con: Slower - a. find_element_* –> single webElObj
b. find_elements_* –> list of webElObj
Using selenium, which methods help you find:
1. Elements that use the CSS class name 2. Elements that match the CSS selector 3. Elements with a matching id attribute value 4. elements that completely match the text provided 5. elements that contain the text provided 6. Elements with a matching name attribute value 7. Elements with a matching tag name (case-insensitive; an element is matched by 'a' and 'A')
What do you need to be careful about?
Using selenium, how do you find:
1. Elements that use the CSS class name >>>browser.find_element_by_class_name(name)
- Elements that match the CSS
selector
»>browser.find_element_by_css_selector(selector) - Elements with a matching id
attribute value
»>browser.find_element_by_id(id) - elements that completely
match the text provided
»>browser.find_element_by_link_text(text) - elements that contain the
text provided
»>browser.find_element_by_partial_link_text(text) - Elements with a matching name
attribute value
»>browser.find_element_by_name(name) - Elements with a matching tag name
(case-insensitive; an element is
matched by ‘a’ and ‘A’)
»>browser.find_element_by_tag_name(name)
Every method is case sensitive except 7.
8x unimportant selenium attributes
tag_name get_attribute(name) text clear() is_displayed() is_enabled() is_selected() location
Give an example of logging in with selenium
> > > from selenium import webdriver
browser = webdriver.Firefox()
browser.get(‘https://login.metafilter.com’)
userElem = browser.find_element_by_id(‘user_name’)
userElem.send_keys(‘your_real_username_here’)
passwordElem = browser.find_element_by_id(‘user_pass’)
passwordElem.send_keys(‘your_real_password_here’)
passwordElem.submit()
Using selenium, how can you:
- type in keys
- click
- submit
First step, always find the element using the selenium methods and save it to a variable, e.g.:
userElem = browser.find_element_by_XY(‘XY’)
- type in keys
Usually, you have to find the or element
»>userElem.send_keys(‘username’) - click
»>userElem.click() - submit
»>userElem.submit()
Using selenium, which methods allow you to:
- Go back
- Go forward
- Refresh
- Quit
- = element saved to variable
- Go back
* .browser.back() - Go forward
* .browser.forward() - Refresh
* .browser.refresh() - Quit
* .browser.quit()