BeautufulSoup

BeautifulSoup是一个可以从 HTML 或 XML 文件中提取数据的 Python 模块。

安装

Bash

pip install beautifulsoup4

bs4 内部是通过其他解析器来实现的，并且他支持多种解析器。默认使用 Python 标准库中的 HTML 解析器。尽管内置的解析器容错性不错但是速度不快，因此更推荐使用 lxml:

Bash

pip install lxml

对象的种类

BeautifulSoup 将复杂的 HTML 文档转换为由 Python 对象构成的树形结构，他们彼此之间又有兄弟、父子、祖孙等关系，但是归根结底在整个处理过程中只包含了四种类型的对象，我们的所有操作也都绑定在这四个对象之上:

BeautifulSoup: 根对象，我们所有操作的起始
Tag: 标签对象，对应了 XML 或 HTML 原生文档中的标签
NavigableString: 字符串对象，是 Tag 中的一段文本
Comment: 注释

整个使用流程就是构建 BeautifulSoup 对象然后获取特定的 Tag 来进行特定的操作的过程。

Tips

BeautifulSoup 是一个特殊的 Tag 对象，他表示整个 HTML 树的根 Tag。

Tag

Tag 是最为重要的对象，其他都是他的子类。该对象上绑定了很多属性和方法，我们几乎所有的操作都是绑定在该对象之上的。

如果仅仅是探究 Tag 的属性，最重要的无疑是 name 和 attributes，他也是构造 HTML/XML 最核心的部分:

Tag.name: 每个标签都有自己的名字(BeautifulSoup 算是一个特殊的标签，他的名字是 document)

Python

tag.name
# u'b'

# 可以更改一个标签的名称，他会自动反应在 BeautifulSoup 构造的对象上
tag.name = 'blockquote'
tag
# <blockquote class="boldest">Extremely bold</blockquote>

Tag.attrs: 获取 Tag 的所有属性，也可以通过 Tag[attr_name] 来获取特定属性的值

Python

tag['id']
# u'boldset'

tag.attrs
# {u"id": "boldset"}

# 可以对属性进行修改，他同样反应在 BeautifulSoup 构造的对象上
tag['id'] = 'verybold'
tag['another-attribute'] = 1
tag
# <b another-attribute="1" id="verybold"></b>

del tag['id']
del tag['another-attribute']
tag
# <b></b>

# 如果读取不存在的属性将抛出 KeyError 异常，我们可以使用 get 来获取
tag['id']
# KeyError: 'id'
tag.get('id', None)

# 对于多值属性 class rel rev accept-charset headers accesskey 会返回列表形式
# 用的最多的就是 class，注意他如果是一个也会返回列表
css_soup = BeautifulSoup('<p class="body"></p>')
css_soup.p['class']
# ["body"]

css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']
# ["body", "strikeout"]

NavigableString

NavigableString 对应标签内的文本片段(还有就是标签之间的回车等空白内容)，我们可以使用 Tag.string 来获取标签对应的文本片段他返回的就是 NavigableString 对象:

Python

tag.string
# u'Extremely bold'
type(tag.string)
# <class 'bs4.element.NavigableString'>

NavigableString 本身是只读的，你无法直接给他赋值，而必须调用他的 replace_with 方法来修改他的值:

Python

tag.string.replace_with("No longer bold")
tag
# <blockquote>No longer bold</blockquote>

NavigableString 尽管继承自 Tag，但是他非常特殊。他是没有自己的子节点的，因此所有有关子节点、后代节点、find() 等方法都无法在 NavigableString 上使用。

Tips

例如 NavigableString.string 是不存在的，因为 string 本质上就是说的子字符串节点。

使用流程

首先我们需要构造 BeautifulSoup 对象:

Python

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup

# 第二个参数指定解析器
# 'html.parser' Python 标准库
# 'lxml' lxml 模块
# 'lxml-xml' 唯一支持解析 XML 的解析器
# 'html5lib' html5lib 模块
soup = BeautifulSoup(html_doc, 'html.parser')

之后就是通过他来获取想要的 Tag 对象，有多种方式来实现这一点:

通过 soup.head 来获取第一个找到的 head 标签
通过 soup.find_all(name, attrs, recursive, string) 来返回所有搜索到的 Tag 对象，这个也是整个 BeautifulSoup 操作的核心，它具有非常丰富的过滤器参数来选择特定的 Tag 对象
之后就是通过节点关系来在 BeautifulSoup 树上来回寻找需要的 Tag

Python

soup.head
# <head><title>The Dormouse's story</title></head>

soup.find_all('b')
# [<b>The Dormouse's story</b>]

Tips

Tag 中也包含 find_all 方法来进一步搜索自己的子 Tag 对象，因此 find_all 方法几乎是 bs4 模块中最重要的方法。

获取了特定的 Tag 之后就是获取其中的字符串或属性:

.string -> NavigableString: 获取 Tag 本身的字符串，他的行为有些特殊如果 tag 只有一个节点且为 NavigableString 类型则返回，如果只有一个节点而子节点有一个节点为 NavigableString 类型同样返回，即嵌套的返回。如果有多个节点返回 None, 因此如果想要真正的子节点需要遍历 child 例如 [child for child in outer_div.children if isinstance(child, NavigableString)] 来实现
.strings/.stripped_strings -> Iterator: 获取 Tag 以及其子孙 Tag 的字符串，stripped_strings 或删除空格或空行这些多余的空白内容
text/get_text(): 大致等价于 ''.join(tag.strings) 他直接返回字符串
.name -> str: Tag 的名称
.attrs -> dict[key, value]: Tag 包含的属性，对于多值属性 value 会以列表形式返回

Python

soup.head.name
# 'head'

# 如果是 stripped_strings 空白行和行首尾的空白会被删除
for string in soup.body.strings:
    print(repr(string))

'\n'
"The Dormouse's story"
'\n'
'Once upon a time there were three little sisters; and their names were\n'
'Elsie'
',\n'
'Lacie'
' and\n'
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
'...'
'\n'

soup.body.a.attrs
# {'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

节点关系

所谓的节点可以简单理解为 Tag 对象(并不准确，其中的换行、空白等 NavigableString 也算节点)，他们之间的关系包括:

.parent: 父节点，一个节点只包含一个父节点
.patents: iterator: 祖辈节点，一个节点可以包含多个祖辈节点
next_sibling、.previous_sibling、next_siblings: iterator、.previous_siblings: iterator: 兄弟节点，每个节点可以包含多个兄弟节点，他们有上一个、下一个的关系
.contents: list、.children: iterator: 子节点，一个 Tag 可以有多个子节点，NavigableString 类型节点是没有子节点的，注意他们只包含节点的直接子节点
.descendants: iterator: 后代节点

Python

soup.body.a.parent.name
# 'p'

list([tag.name for tag in soup.body.a.parents])
# ['p', 'body', 'html', '[document]']

list(soup.body.a.next_siblings)
# [',\n', # 注意她是 NavigableString 对象
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # Tag 对象
# ' and\n',
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>,
# ';\nand they lived at the bottom of a well.']

Tips

在理解节点中一个重要的盲点就是忽略换行这类空白符，他们本身因为不需要标签包围所以非常容易被忽略。在 bs4 中他们被认为是 NavigableString 对象。他们本身没有子节点，但是具有兄弟和父节点。

next_element/previous_element

bs4 中将解析的过程称为 element，其中 next_element 属性指向解析过程中下一个被解析的对象(字符串或 tag), 结果可能与 .next_sibling 相同，但通常是不一样的。

如果想要知道下一步的解析而不在乎嵌套关系的时候很有用。

搜索文档树: find_all

bs4 或者说所有 HTML 解析的核心就是在整个 HTML 文档树中搜索需要的节点:

Python

def find_all(
    name: str|list[str]|Callable|bool|re.Compile,  # 用于过滤符合条件的标签
    attrs: dict[str, str|re.Compile], # 用于过滤符合条件的属性，如果属性名不符合 Pyhton 标识符时使用
    string: str|list[str]|Callable|bool|re.Compile,  # 用于搜索标签的文本
    recursive: bool, # 递归获取子孙节点，默认时获取，可以 False 来返回直接子节点
    limit: int, # 返回符合条件的 Tag 的个数
    **kwargs, # 同样用于过滤属性
) -> list: # 如果没有找到结果返回空列表
    pass

Note

find_all() 是使用最多的方法，其他还包含一系列的具有相同参数的方法来搜索特定方向的节。例如 find() 搜索匹配到的第一个节点，find_parents()/find_parent() 搜索父节点，find_next_siblings()/find_next_sibling()/find_previous_siblings()/find_previous_sibling() 来搜索兄弟节点

Tips

其中 find_all、find_parents 等搜索多结果的如果没有找到结果返回空列表。像 find、find_paret 等搜索单一结果的如果没有找到返回 None

name: 指定过滤器来获取符合条件的标签

他具有多种参数类型:

Python

# 1. str: 最简单的就是字符串，来获取完全匹配的标签
soup.find_all('b')
# [<b>The Dormouse's story</b>]

# 2. list[str]: 对 str 的引申，允许指定多个标签来进行搜索
soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# 3. 正则表达式
import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b

# 4. bool: 只有一个 True 他表示获取所有 Tag，需要注意她不返回字符串节点
for tag in soup.find_all(True):
    print(tag.name)
# html
# head
# title
# body
# p
# b
# p
# a
# a
# a
# p

# 5. callable: 如果没有合适的过滤器，可以通过方法来实现，他的参数是一个 Tag，返回为 True 的被选择
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
#  <p class="story">Once upon a time there were...</p>,
#  <p class="story">...</p>]

attrs: 用于搜索具有特定参数的标签

用于所有特定参数的标签，支持字符串和正则表达式和 True:

Python

# 1. str
soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

# 2. 正则表达式
soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

# 3. True 表示匹配所有存在该属性的标签
soup.find_all(id=True)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# 可以指定多个来进行与匹配
soup.find_all(href=re.compile("elsie"), id='link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

有些属性名并不是 Python 标准的标识符，此时可以使用关键字参数 attrs 来指定:

Python

data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

# 比较特殊 bs4 中 name 用来识别 tag 本身的名称，因此不能用于属性，此时就必须使用 attrs
name_soup = BeautifulSoup('<input name="email"/>', 'html.parser')
name_soup.find_all(name="email")
# []
name_soup.find_all(attrs={"name": "email"})
# [<input name="email"/>]

# 同样 class 时 Python 的关键字无法作为标识符因此需要 attrs
# 当然 bs4 提供了 class_
soup.find_all("a", attrs={"class": "sister"})
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Tips

稳妥起见，对于属性搜索完全可以使用 attrs 来进行

string: 用于搜索标签的文本内容

他同样支持字符串、列表、正则表达式、函数和 True。其中字符串和列表比较局限，他们只会返回完全匹配的内容，即和 Tag.string 结果相同的 Tag。使用最多的倒是正则表达式:

Python

soup.find_all(string=re.compile("Dormouse"))
# [u"The Dormouse's story", u"The Dormouse's story"]

这个比较特殊，name、attrs 都是隶属于 Tag 的返回的结果自然是 Tag 对象，而如果只指定了 string 返回的结果就是 NavigableString。如果想要返回 Tag 可以赋值 name=True:

Python

soup.find_all(True, string=re.compile("Dormouse"))

CSS 选择器

BeautifulSoup 对象和 Tag 对象支持通过 .css 属性实现 CSS 选择器:

Python

soup.css.select("body a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.css.select("html head title")
# [<title>The Dormouse's story</title>]

Tips

当然大多数时候不需要在 bs4 中使用 CSS 选择器，有这个需求更推荐使用parsel