Web Scraper Project Flashcards
beautifulsoup method from bs4
syntax and function
BeautifulSoup(string/html code, parser)
.li
returns FIRST OCCURRENCE of list item header in html code <li>
.head method from bs4
attach to variable containing beautiful soup object
returns headers from html code (?)
what are html tags (2)
the building blocks of HTML documents, defining the structure and content of the webpage.
Tags are enclosed in angle brackets (< >) and usually come in pairs: an opening tag and a closing tag.
what is the basic structure of an html tag? (3)
opening tag, content, and closing tag, in that order.
what about self-closing tags?
do not have content and are self-closing. They end with a forward slash before the closing angle bracket.
Example: <img src="image.jpg" alt="Image description" /
what is a div tag?
<div>: Defines a division or section in an HTML document.
</div>
what is a span tag?
<span>: Defines a section in a document (inline) for styling purposes.</span>
what is the body tag?
<body>: Contains the content of the HTML document that is visible to users.
</body>
what is the title tag?
<title>: Sets the title of the webpage (displayed in the browser tab).
</title>
what is the head tag?
<head>: Contains meta-information about the HTML document (e.g., title, meta tags, links to stylesheets).
</head>
what are h1 and h2, etc tags?
<h1> to <h6>: Define headings, with <h1> being the highest level and <h6> the lowest.
</h6></h1></h6></h1>
what is a p tag?
<p>: Defines a paragraph.
</p>
what is an a tag?
<a>: Defines a hyperlink.</a>
what is an img tag?
<img></img>: Embeds an image.
what is an html class? (4)
1 a class is an attribute used to define a group of elements with similar properties. Classes are primarily used for styling and scripting purposes.
2 Reusability: Classes allow you to apply the same styles or behaviors to multiple elements.
3 Multiple Classes: An element can have multiple classes, enabling the combination of different styles and behaviors.
4 CSS and JavaScript Integration: Classes are extensively used in CSS for styling and in JavaScript for dynamic behavior.
what do find and find_all do?
allow you to locate elements based on tag names, attributes, and more.
find method
searches for the first occurrence of a specified tag or element that matches the given criteria.
Common Parameters:
name: The name of the tag to search for (e.g., 'div', 'p', 'a'). attrs: A dictionary of attributes to match (e.g., {'class': 'example'}). recursive: If True (default), it searches within all descendants. If False, it only searches within direct children. string: A NavigableString or regular expression to search for text content.
find all method (bs4)
searches for all occurrences of a specified tag or element that match the given criteria and returns them as a list.
what is the <ol> tag?
ordered list. often used in conjunction with the <li> (list item) tag to define each item within the list. also nestable.
syntax:
<ol>
<li>First item</li>
<li>Second item</li>
<li>Third item</li>
</ol>
find all method (bs4)
soup.find_all(name, attrs, recursive, string, limit, **kwargs)
Common Parameters:
name: The name of the tag to search for. attrs: A dictionary of attributes to match. recursive: If True (default), it searches within all descendants. If False, it only searches within direct children. string: A NavigableString or regular expression to search for text content. limit: Limits the number of results returned.
html attribute
a modifier of an HTML element that provides additional information about the element. Attributes are used to configure elements and can affect their behavior or appearance. always appear in quotes.
common html attributes
id: A unique identifier for the element within the HTML document.
class: Specifies one or more class names for the element, which can be used by CSS and JavaScript.
src: Specifies the source URL of an embedded content like an image or a script.
href: Specifies the URL of a link.
alt: Provides alternative text for an image, which is displayed if the image cannot be loaded.
title: Provides additional information about the element, often displayed as a tooltip when the mouse hovers over the element.
style: Specifies inline CSS styles for an element.
type: Specifies the type of an input element in forms.
what is the generalised format of selecting information using attributes/
soup.find_all(attrs = {“attribute_name” : “Value of attribute”})
soup.select method
allows you to use CSS selectors to locate elements within the parsed HTML document. allows the use of CSS selectors, which can be more flexible and powerful compared to other methods like find or find_all.
what is a css selector?
a pattern used to select and style elements within an HTML document. CSS selectors define which HTML elements a set of CSS rules apply to.
what is the .head method?