Detailed docs: the Beautiful Soup 4 Docs.
Assume t is an object of Tag.
Core concepts (classes)
Tag, a Tag object corresponds to an XML or HTML tag.BeautifulSoup, the BeautifulSoup object represents the parsed document as a whole. You can treat it like a special Tag. It needs a parser to parse the document, a built-in parser is"html.parser", e.g.soup = BeautifulSoup("<html>a web page</html>", 'html.parser')NavigableString, a string corresponds to a bit of text (as you see it in the browser) within a tag. A NavigableString is just like a Python Unicode string, except that it also supports some of the features for navigating the tree and searching the tree.
The Tag class
Object attributes:
t.name, the text inside the angle brackets, for example,<a>t.attrs, accesses all attributes of a tag as a dictt["foo"], gets the HTML/XML attribute of "foo", set it byt["foo"] = "bar"Beautiful Soup presents the value(s) of a multi-valued attribute as a list, e.g.t['class']t.stringgets the text within a tagt.get_text()gets the human-readable text as a string inside a document/tag
Navigating the tree
- get the first tag by its name, e.g.
t.head,t.title - get all direct children as a list by
t.contents, or using the.childrengenerator, e.g.for tag in t.children: pass t.parentsto iterate over all of an element's parentst.previous_sibling, t.next_siblingto go sideways, ort.previous_siblings, t.next_siblingsto iterate over siblings.
Searching the tree
Filter types:
- A string
- A regex
- A list
- A function
Search methods:
t.find_all()looks through a tag’s descendants and retrieves all descendants that match your filterst.select()matches elements by using CSS selectorst.find()likesfind_all()but it only finds one resultt.find_parents()andt.find_parent()t.find_parents()andt.find_parent()t.find_next_siblings()andt.find_next_sibling()t.find_previous_siblings()andt.find_previous_sibling()
Debugging
print(t.prettify())pretty-prints the tagprint(t)prints the html without beautificationtype(t)shows the type of an object