当前位置:知识文库 ❯ 图文

BeautifulSoup修改文档树 - 添加删除替换标签完整教程

BeautifulSoup不仅能解析和搜索HTML文档，还可以修改文档树。修改操作包括添加、删除、替换标签和文本，以及修改标签属性等。这些功能在HTML清洗、内容重组、模板生成等场景中非常有用。修改操作直接作用于解析树，修改后可以输出新的HTML。

本文将全面介绍BeautifulSoup的所有修改方法，从基本的属性修改到高级的标签包装，帮助你掌握HTML文档的动态编辑能力。

一、修改文档树概述

BeautifulSoup提供了丰富的API来修改已解析的HTML文档。你可以添加新的标签到文档中、删除不需要的元素、替换现有内容、或者用新的标签包装原有元素。所有修改都会直接作用于内存中的解析树，修改完成后可以通过str(soup)或soup.prettify()输出修改后的HTML。

小贴士

修改操作是原地修改，会直接改变BeautifulSoup对象。如果需要保留原始文档，请先使用copy.copy()创建副本。

二、修改方法语法

代码示例

# 修改属性
tag['attr'] = 'value'
del tag['attr']

# 修改文本
tag.string = 'new text'
tag.string.replace_with('new text')

# 添加标签
tag.append(new_tag_or_string)
tag.insert(position, new_tag_or_string)
tag.insert_before(new_tag_or_string)
tag.insert_after(new_tag_or_string)

# 删除标签
tag.clear()          # 清除内容
tag.decompose()      # 删除标签及内容
tag.extract()        # 删除并返回标签

# 替换标签
tag.replace_with(new_tag_or_string)

# 包装标签
tag.wrap(wrapper_tag)
tag.unwrap()

# 创建新标签
new_tag = soup.new_tag('tag_name', attrs={})
new_string = soup.new_string('text')

三、修改方法参数说明

方法	说明	返回值
append(tag/string)	末尾添加子节点	无
insert(pos, tag/string)	指定位置插入	无
insert_before(tag/string)	在当前标签前插入	插入的内容
insert_after(tag/string)	在当前标签后插入	插入的内容
clear()	清除标签内容	无
decompose()	销毁标签及内容	无
extract()	移除标签并返回	被移除的标签
replace_with(tag/string)	替换当前标签	被替换的标签
wrap(tag)	用标签包装当前标签	包装后的标签
unwrap()	去除包装标签	被去除的标签

四、添加和插入标签示例

append()在子节点末尾添加，insert()在指定位置插入，insert_before()和insert_after()则在当前标签的前后插入兄弟节点。

代码示例

from bs4 import BeautifulSoup

html = '<ul></ul>'
soup = BeautifulSoup(html, 'html.parser')
ul = soup.ul

# append：末尾添加
li1 = soup.new_tag('li')
li1.string = '项目1'
ul.append(li1)

# insert：指定位置插入（索引0表示最前面）
li2 = soup.new_tag('li')
li2.string = '项目2'
ul.insert(0, li2)

print(f"添加后: {ul}")

# insert_before/insert_after
target = soup.find(string='项目1')
new_li = soup.new_tag('li')
new_li.string = '项目1.5'
target.parent.insert_after(new_li)

print(f"插入后: {ul}")

输出：

代码示例

添加后: <ul><li>项目2</li><li>项目1</li></ul>
插入后: <ul><li>项目2</li><li>项目1</li><li>项目1.5</li></ul>

注意insert(0, li2)将"项目2"插入到最前面，而insert_after()是在"项目1"的父级（ul）之后插入新的li元素。

五、删除和替换标签示例

BeautifulSoup提供了三种删除方法：clear()清除内容但保留标签，decompose()彻底销毁标签，extract()移除标签并返回引用。

代码示例

from bs4 import BeautifulSoup

html = """
<div>
    <p class="keep">保留段落</p>
    <p class="remove">删除段落</p>
    <p class="replace">替换段落</p>
    <p class="ad">广告内容</p>
</div>
"""

soup = BeautifulSoup(html, 'html.parser')

# extract：移除并返回
removed = soup.find(class_='remove').extract()
print(f"移除的标签: {removed}")

# decompose：销毁（不返回）
ad = soup.find(class_='ad')
ad.decompose()
print(f"销毁广告后: {soup.div}")

# replace_with：替换
old = soup.find(class_='replace')
new_tag = soup.new_tag('h3')
new_tag.string = '替换后的标题'
old.replace_with(new_tag)
print(f"替换后: {soup.div}")

输出：

代码示例

移除的标签: <p class="remove">删除段落</p>
销毁广告后: <div>
    <p class="keep">保留段落</p>
    
    <p class="replace">替换段落</p>
    
</div>
替换后: <div>
    <p class="keep">保留段落</p>
    
    <h3>替换后的标题</h3>
    
</div>

extract()返回了被移除的标签对象，可以后续重新插入到文档中。而decompose()彻底销毁了广告标签，无法恢复。

六、包装和修改文本示例

wrap()用于用新标签包装当前元素，unwrap()则相反，用于去除包装。这在处理富文本内容时非常实用。

代码示例

from bs4 import BeautifulSoup

html = '<p>这是一段包含<em>重点</em>的文字</p>'
soup = BeautifulSoup(html, 'html.parser')

# wrap：用标签包装
em = soup.em
em.wrap(soup.new_tag('strong'))
print(f"wrap后: {soup.p}")

# unwrap：去除包装标签
strong = soup.strong
strong.unwrap()
print(f"unwrap后: {soup.p}")

# 修改文本 - 直接赋值会清除所有子节点
p = soup.p
p.string = '全新的文本'
print(f"修改文本后: {p}")

# replace_with替换文本内容
p2 = BeautifulSoup('<p>原始文本</p>', 'html.parser').p
p2.string.replace_with('替换文本')
print(f"replace_with后: {p2}")

输出：

代码示例

wrap后: <p>这是一段包含<strong><em>重点</em></strong>的文字</p>
unwrap后: <p>这是一段包含<em>重点</em>的文字</p>
修改文本后: <p>全新的文本</p>
replace_with后: <p>替换文本</p>

提示：设置tag.string = '新文本'会清除标签内所有子节点，替换为单一文本。如果标签内有子标签需要保留，请使用tag.string.replace_with()来仅替换文本内容。

七、实际应用场景

场景1：HTML清洗，移除广告、脚本、样式等无关内容

抓取网页内容后，经常需要清洗掉script、style、广告等无关元素，只保留正文内容：

代码示例

from bs4 import BeautifulSoup

html = """
<html>
<head>
    <style>.ad { display: none; }</style>
    <script>alert('广告');</script>
</head>
<body>
    <article>
        <h1>文章标题</h1>
        <p>这是正文内容...</p>
        <div class="ad">这是广告</div>
    </article>
</body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

# 移除所有script和style标签
for tag in soup.find_all(['script', 'style']):
    tag.decompose()

# 移除广告元素
for ad in soup.find_all(class_='ad'):
    ad.decompose()

print(soup.prettify())

场景2：内容重组，将提取的数据重新组织为新的HTML结构

代码示例

from bs4 import BeautifulSoup

# 原始表格数据
html = """
<table>
    <tr><td>张三</td><td>25</td><td>北京</td></tr>
    <tr><td>李四</td><td>30</td><td>上海</td></tr>
</table>
"""

soup = BeautifulSoup(html, 'html.parser')
new_div = soup.new_tag('div', attrs={'class': 'user-cards'})

for row in soup.find_all('tr'):
    cells = row.find_all('td')
    card = soup.new_tag('div', attrs={'class': 'card'})
    card.append(f'{cells[0].string}，{cells[1].string}岁，来自{cells[2].string}')
    new_div.append(card)

print(f"重组后:\n{new_div}")

场景3：HTML模板生成，动态填充内容到模板中

代码示例

from bs4 import BeautifulSoup

# 基础模板
template = '''
<div class="product">
    <h2 class="name">{{名称}}</h2>
    <p class="price">¥{{价格}}</p>
    <p class="desc">{{描述}}</p>
</div>
'''

soup = BeautifulSoup(template, 'html.parser')
soup.find(class_='name').string = 'Python编程入门'
soup.find(class_='price').string = '¥59.90'
soup.find(class_='desc').string = '适合初学者的Python教程'

print(soup.prettify())

八、注意事项

注意：decompose()彻底销毁标签，无法恢复；extract()移除但保留引用，可以重新插入。

注意：修改操作直接影响解析树，如需保留原始文档请先copy()。

注意：new_tag()创建的标签属于当前soup对象，不能跨soup使用。

注意：设置tag.string会清除标签内所有子节点，替换为单一文本。

九、删除方法对比

方法	说明	返回值	标签是否可恢复
clear()	清除内容（保留标签）	无	否
decompose()	销毁标签和内容	无	否
extract()	移除标签	被移除的Tag	是（可重新插入）

十、常见问题FAQ

常见问题

decompose()和extract()应该选择哪个？

如果确定不再需要被删除的元素，使用decompose()更彻底，可以释放内存。如果后续可能重新插入该元素到其他位置，使用extract()，它会返回被移除的Tag对象。

wrap()和insert_after()有什么区别？

wrap()是将当前标签作为新标签的子节点（包装关系），如重点变为重点。而insert_after()是在当前标签后插入兄弟节点，两者是平级关系。

new_tag()创建的标签可以跨soup使用吗？

不可以。new_tag()创建的标签绑定到创建它的soup对象。如果需要将一个soup中的标签移动到另一个soup，应先使用extract()取出，然后再插入到目标soup中。

如何批量修改多个标签的属性？

可以先用find_all()找到所有目标标签，然后遍历修改： for img in soup.find_all('img'): img['loading'] = 'lazy' 这样可以一次性给所有图片添加懒加载属性。

unwrap()的典型使用场景是什么？

unwrap()常用于去除不必要的包装标签。例如，将文本中的span去掉，只保留"文本"。在清洗HTML格式时经常用到。

小结

append/insert用于添加内容，insert_before/insert_after在指定位置插入
extract()移除并返回引用，decompose()彻底销毁标签
replace_with()替换标签，wrap()/unwrap()添加/去除包装
修改操作直接影响解析树，注意保留原始文档副本

练习题

练习1

编写程序，解析一段HTML，使用decompose()移除所有