python爬虫：BeautifulSoup_遍历文档树

前提、回顾1、因为最近工作中都是在跟XML格式的报文打交道：主要就是XML报文的解析、入库。在做自动化时，需要解析XML报文，前面虽然学习过下BeautifulSoup，结果这次在写脚本时，突然发现前面学的还不是很全(当时也没想到会用到这么多，就随便了解了下)，所以现在又得在回顾、补充下2、在实际写的时候发现，其实BeautifulSoup在解析XML时跟解析HTML差不多，只是说在指定解...

不怕猫的耗子A

2185人浏览 · 2020-01-16 22:34:03

不怕猫的耗子A · 2020-01-16 22:34:03 发布

前提、回顾

1、因为最近工作中都是在跟XML格式的报文打交道：主要就是XML报文的解析、入库。在做自动化时，需要解析XML报文，前面虽然学习过下BeautifulSoup，结果这次在写脚本时，突然发现前面学的还不是很全(当时也没想到会用到这么多，就随便了解了下)，所以现在又得在回顾、补充下

2、在实际写的时候发现，其实BeautifulSoup在解析XML时跟解析HTML差不多，只是说在指定解析器时需要指定为"xml"并且"xml"一定要小写，如果是大写的话，好像会报错

3、tag对象：tag就是HTML或XML文档中的一个个标签对，其实我们在解析时主要就是想要找到每一个或指定tag标签(tag对象)，以及tag对象中的字符串(NavigableString对象、Comment对象)以及tag对象的名字(name属性)及其属性(attrs属性)

遍历文档树

子节点

1、一个Tag可能包含多个字符串或其它的Tag，这些都是这个Tag的子节点。Beautiful Soup提供了许多操作和遍历子节点的属性

2、获取某个具体的tag对象的方法前面介绍过了：soup对象.标签对名(返回第一个符合要求的tag对象)或使用find_all("标签名")来查找符合要求的全部tag对象

3、有个获取tag的小窍门,可以在文档树的tag中多次调用这个方法

注: Beautiful Soup中字符串不支持这些属性,因为字符串没有子节点

例1：

from bs4 import BeautifulSoup
html = """
  <p class="story">Once upon a time there were three little sisters; and their names were 
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, 
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
  </p> 
"""

soup = BeautifulSoup(html, "lxml")

tag_p = soup.p
print(tag_p)
print("通过点取属性的方式只能获得当前名字的第一个tag")
print(soup.find_all("a"))
print("find_all()可以得到所有的标签")
tag_p_a = tag_p.a
print(tag_p_a)

"""
<p class="story">Once upon a time there were three little sisters; and their names were 
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and 
  </p>
通过点取属性的方式只能获得当前名字的第一个tag
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
find_all()可以得到所有的标签
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
"""

.contents

1、tag的.content属性可以将tag的子节点以列表的方式输出。列表中的元素不止有tag对象还有navgableString对象(navgableString对象也是列表中的一个元素)

2、值得注意的是，哪怕只有一个换行\n也会占用contents一个位置

3、.contents属性只能找出一个节点的子节点，如果某个子节点还存在子节点，那么子节点的子节点会与子节点作为一体输出

4、字符串没有.contents属性,因为字符串没有子节点

例2：

from bs4 import BeautifulSoup

html = """
<html>
 <head>
  <title>The Dormouse's story</title>
 </head> 
 <body> 
  <p class="title"><b>The Dormouse's story</b></p> 
  <p class="story">Once upon a time there were three little sisters; and their names were 
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, 
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.
  </p> 
  <p class="story">...</p>
 </body>
</html>
"""

soup = BeautifulSoup(html,"lxml")

tag_head = soup.head            
tag_title = tag_head.contents    #获取某个节点的子节点
print(tag_head)
print(tag_title)      

for i in soup.find_all("p"):    #获取某个全部标签对下的子节点
    print(i.contents)

"""
for i in soup.find_all("p"):
    for n in i.contents:
        if str(type(n)) == "<class 'bs4.element.Tag'>":
            print("tag对象有：",n)
        else:
            print("字符串对象有：",n)
"""

"""
<head>
<title>The Dormouse's story</title>
</head>
['\n', <title>The Dormouse's story</title>, '\n']
[<b>The Dormouse's story</b>]
['Once upon a time there were three little sisters; and their names were \n    ', <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, ', \n    ', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' and \n    ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '; and they lived at the bottom of a well.\n  ']
['...']
"""

注：
上面例子中如果列表中的元素是一个tag对象，那么就可以使用tag对象的.name和.attrs属性。以及navgableString对象属性(.string)

.children

1、通过tag的.children生成器,可以对tag的子节点进行循环

2、.children返回的是一个list类型的迭代器

3、.children属性也是用来找某个标签对下面的子节点的(对于子节点下面的子节点是不能直接找出的)

例3：

from bs4 import BeautifulSoup

html = """
<html>
 <head>
  <title>The Dormouse's story</title>
 </head> 
 <body> 
  <p class="title"><b>The Dormouse's story</b></p> 
  <p class="story">Once upon a time there were three little sisters; and their names were 
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, 
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.
  </p> 
  <p class="story">...</p>
 </body>
</html>
"""

soup = BeautifulSoup(html,"lxml")

tag_head = soup.head
tag_title = tag_head.children    #获取某个节点的子节点
print(tag_head)
print(tag_title)

for i in soup.find_all("p"):    #获取某个全部标签对下的子节点
    tag = i.children  #以生成器的形式返回
    print(tag)
    print("--------")
    for n in tag:
        print(n)

"""
<head>
<title>The Dormouse's story</title>
</head>
<list_iterator object at 0x0000028AD1BB7470>
<list_iterator object at 0x0000028AD1BB74E0>
--------
<b>The Dormouse's story</b>
<list_iterator object at 0x0000028AD1BB7550>
--------
Once upon a time there were three little sisters; and their names were 
    
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
, 
    
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
 and 
    
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
; and they lived at the bottom of a well.
  
<list_iterator object at 0x0000028AD1BB74E0>
--------
...
"""

.descendants

1、.contents和.children属性仅包含tag的直接子节点：只会返回目标tag对象下的子节点，若子节点下还有子节点(孙节点)则是与子节点作为一体返回的

2、.descendants属性可以对所有tag的子孙节点进行递归循环

3、.descendants同样是list迭代器，只不过指的是子孙节点，用法同children

4、输出数据中不止有tag对象还有navgableString对象

5、<title>The</title>也包含一个子节点(字符串The),这种情况下字符串"The"也属于<title>标签的子孙节点。此时对<title>使用.contents属性时会返回"The"
例4：

from bs4 import BeautifulSoup

html = """
<html>
 <head>
  <title>The Dormouse's story</title>
 </head> 
 <body> 
  <p class="title"><b>The Dormouse's story</b></p> 
  <p class="story">Once
    <a href="elsie" class="sister" id="link1">Elsie</a>
    <a href="lacie" class="sister" id="link2">Lacie</a>
    <a href="tillie" class="sister" id="link3">Tillie</a>well
  </p> 
  <p class="story">...</p>
 </body>
</html>
"""

soup = BeautifulSoup(html, "lxml")

tag_body = soup.body
tag_title = tag_body.descendants
print(tag_body)
print(tag_title)

for i in tag_title:
    print(i)
"""
for i in tag_title:
    if str(type(i)) == "<class 'bs4.element.Tag'>":
        print("tag对象有：",i)
"""

"""
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once
    <a class="sister" href="elsie" id="link1">Elsie</a>
<a class="sister" href="lacie" id="link2">Lacie</a>
<a class="sister" href="tillie" id="link3">Tillie</a>well
  </p>
<p class="story">...</p>
</body>
<generator object descendants at 0x0000025567E36D58>

<p class="title"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
The Dormouse's story

<p class="story">Once
    <a class="sister" href="elsie" id="link1">Elsie</a>
<a class="sister" href="lacie" id="link2">Lacie</a>
<a class="sister" href="tillie" id="link3">Tillie</a>well
  </p>
Once
    
<a class="sister" href="elsie" id="link1">Elsie</a>
Elsie

<a class="sister" href="lacie" id="link2">Lacie</a>
Lacie

<a class="sister" href="tillie" id="link3">Tillie</a>
Tillie
well
  
<p class="story">...</p>
...
"""

注：从上面的输出结果可以看出
1、使用.descendants属性来找tag对象的子孙节点是：先找出tag对象的子节点(此时子节点可能嵌套了孙节点)，然后再从子节点中去找孙节点。会将嵌套了孙节点的子节点、孙节点、navgableString对象一起输出

2、.descendants属性输出数据中不止有tag对象还有navgableString对象，所以可以使用一个if判断语句来区分二者：if str(type(i)) == "<class 'bs4.element.Tag'>"

例5：

from bs4 import BeautifulSoup

xml = """
<?xml version="1.0" encoding="utf-8"?>
<nitf> 
  <head> 
    <title>Colombia Earthquake</title> 
  </head>  
  <body> 
    <headline> 
      <hl1>143 Dead in Colombia Earthquake</hl1> 
    </headline>  
    <byline> 
      <bytag>By Jared Kotler, Associated Press Writer</bytag> 
    </byline>  
    <dateline> 
      <location>Bogota, Colombia</location>  
      <date>Monday January 25 1999 7:28 ET</date> 
    </dateline> 
  </body> 
</nitf>
"""

soup = BeautifulSoup(xml,"xml")

tag_name_list = []
tag_value_list = []
for i in soup.body.descendants:
    if str(type(i)) == "<class 'bs4.element.Tag'>":
        tag_name_list.append(i.name)
    elif i != "\n":
        tag_value_list.append(i)
print(tag_value_list)
print(tag_name_list)
#不能直接将这两个列表组成字典，因为有些标签是嵌套其子节点的，它们是没有navgableString对象的(.string返回的是None)

dict1 ={}
for n in tag_name_list:
    print(soup.find(n).name)
    print(str(soup.find(n).string))
    #print({soup.find(n).name:str(soup.find(n).string)})
    if str(soup.find(n).string) != "None":
        dict1[soup.find(n).name] = str(soup.find(n).string)
print(dict1)

"""
['143 Dead in Colombia Earthquake', 'By Jared Kotler, Associated Press Writer', 'Bogota, Colombia', 'Monday January 25 1999 7:28 ET']
['headline', 'hl1', 'byline', 'bytag', 'dateline', 'location', 'date']
headline
None
hl1
143 Dead in Colombia Earthquake
byline
None
bytag
By Jared Kotler, Associated Press Writer
dateline
None
location
Bogota, Colombia
date
Monday January 25 1999 7:28 ET
{'hl1': '143 Dead in Colombia Earthquake', 'location': 'Bogota, Colombia', 'bytag': 'By Jared Kotler, Associated Press Writer', 'date': 'Monday January 25 1999 7:28 ET'}
"""

.strings

1、前面说的要获得一个tag对象的navgableString对象可以使用.string属性(tag对象.string)，只是说.string属性有点局限性
   ⑴如果tag只有一个NavigableString类型子节点,那么这个tag可以使用.string得到子节点
   ⑵如果一个tag仅有一个子节点,那么这个tag也可以使用.string方法,输出结果与当前唯一子节点的.string结果相同
   ⑶如果tag包含了多个子节点,tag就无法确定.string方法应该调用哪个子节点的内容, .string的输出结果是None

2、如果tag对象中包含多个字符串(多个子节点),可以使用.strings(tag对象.strings)来循环获取各个子节点下的navgableString对象
⑴输出的字符串中可能包含了很多空格或空行,使用.stripped_strings可以去除多余空白内容

例6：

from bs4 import BeautifulSoup

html = """
<html>
 <head>
  <title>The Dormouse's story</title>
 </head> 
 <body> 
  <p class="story">Once
    <a href="elsie" class="sister" id="link1">Elsie</a>ss
    <a href="lacie" class="sister" id="link2">Lacie</a>
    <a href="tillie" class="sister" id="link3">Tillie</a>well
  </p> 
  <p class="story">...</p>
 </body>
</html>

"""

soup = BeautifulSoup(html, "lxml")

title_string = soup.title.string
print(title_string)
print("-------")
p_string = soup.p.string#第一个p标签对下有很多子节点
print(p_string)
print("-------")
ps_string = soup.p.strings    #这里也可以使用.stripped_strings
print(ps_string)
print("-------")
for i in ps_string:    #可以使用repr()或str()将NavigableString对象转为string
    print(i)

"""
The Dormouse's story
-------
None
-------
<generator object _all_strings at 0x00000289D26A6D58>
-------
Once
    
Elsie
ss
   
Lacie

Tillie
well
"""

CSS选择器

1、Beautiful Soup支持大部分的CSS选择器,在Tag或BeautifulSoup对象的.select()方法中传入字符串参数,即可使用CSS选择器的语法找到tag

2、返回信息为：所以符合查找要求的tag对象组成的列表。感觉跟find_all()方法一样

例7：

from bs4 import BeautifulSoup

html = """
<html>
 <head>
  <title>The Dormouse's story</title>
 </head> 
 <body> 
  <p class="story">Once
    <a href="elsie" class="sister" id="link1">Elsie</a>ss
    <a href="lacie" class="sister" id="link2">Lacie</a>
    <a href="tillie" class="sister" id="link3">Tillie</a>well
  </p> 
  <p class="story">...</p>
 </body>
</html>
"""

soup = BeautifulSoup(html, "lxml")

tag_a = soup.select("a")
print(tag_a)

"""
[<a class="sister" href="elsie" id="link1">Elsie</a>,
 <a class="sister" href="lacie" id="link2">Lacie</a>,
 <a class="sister" href="tillie" id="link3">Tillie</a>]
"""

修改文档树

修改tag对象的名称和属性

1、前面说过tag对象的.name和.attres属性返回的是一个字典，可以处理字典的方法来进行处理
例8：

from bs4 import BeautifulSoup

html = """
<a href="elsie" class="sister" id="link1">Elsie</a>ss
"""

soup = BeautifulSoup(html, "lxml")

tag_a = soup.find("a")
tag_a.name = "aa"
tag_a["href"] = "rrrr"
print(tag_a)

"""
<aa class="sister" href="rrrr" id="link1">Elsie</aa>
"""

修改.string

1、给tag对象的 .string属性赋值,就相当于用当前的内容替代了原来的内容

2、注意: 如果当前的tag包含了其它tag,那么给它的.string属性赋值会覆盖掉原有的所有内容。包括子tag(子tag将不存在)

3、tag中包含的字符串不能编辑,但是可以被替换成其它的字符串,用replace_with()方法

例9：

from bs4 import BeautifulSoup
html = """
  <title>The Dormouse's story</title>
  <p class="story">p标签下的NavigableString对象
    <a href="elsie" class="sister" id="link1">Elsie</a>ss
    <a href="lacie" class="sister" id="link2">Lacie</a>
    <a href="tillie" class="sister" id="link3">Tillie</a>well
  <blockquote>No longer bold</blockquote>
  </p> 
"""

soup = BeautifulSoup(html, "lxml")

tag_title = soup.find("title")
tag_title.string = "aa"
print(tag_title)

tag_p = soup.find("p")
tag_p.string = "qqqq"
print(tag_p)

tag_blockquote = soup.find("blockquote")
tag_blockquote.string.replace_with("wwww")
print(tag_blockquote)

"""
<title>aa</title>
<p class="story">qqqq</p>
<blockquote>wwww</blockquote>
"""

append()

1、Tag.append()方法想tag中添加内容,就好像Python的列表的.append()方法

2、如果想添加一段文本内容到文档中也没问题,可以调用Python的append()方法或调用工厂方法BeautifulSoup.new_string()方法

3、如果想要创建一段注释,或 NavigableString 的任何子类,将子类作为new_string()方法的第二个参数传入(需要from bs4 import Comment)

例10：

from bs4 import BeautifulSoup
from bs4 import Comment
html = """
  <title>The Dormouse's story</title>
  <p class="story">
    <a href="elsie" class="sister" id="link1">Elsie</a>ss
    <a href="lacie" class="sister" id="link2">Lacie</a>
    <a href="tillie" class="sister" id="link3">Tillie</a>well
  </p> 
"""

soup = BeautifulSoup(html, "lxml")
tag_title = soup.find("title")

tag_title.append("aaaa")

new_string_1 = soup.new_string("cccc")
tag_title.append(new_string_1)

new_string_2 = soup.new_string("bbbb",Comment)#需要调用from bs4 import Comment

tag_title.append(new_string_2)
print(tag_title)

"""
<title>The Dormouse's storyaaaacccc<!--bbbb--></title>
"""

clear()

Tag.clear()方法移除当前tag对象的内容：将指定tag对象下的NavigableString对象和子节点以及子节点下的NavigableString对象

例11：

from bs4 import BeautifulSoup
html = """
  <title>The Dormouse's story</title>
  <p class="story">p标签下的NavigableString对象
    <a href="elsie" class="sister" id="link1">Elsie</a>ss
    <a href="lacie" class="sister" id="link2">Lacie</a>
    <a href="tillie" class="sister" id="link3">Tillie</a>well
  </p> 
"""

soup = BeautifulSoup(html, "lxml")
tag_p = soup.find("p")
tag_p.clear()

print(tag_p)

"""
<p></p>
"""

new_tag()

1、创建一个tag最好的方法是调用工厂方法BeautifulSoup.new_tag()

2、我们可以使用 new_tag 方法来生成一个新的标签，然后使用 append() 、insert() 、insert_after() 、insert_before()方法来将标签添加到 HTML 树中
例12：

from bs4 import BeautifulSoup
html = """
<title>The Dormouse's story</title>
<p>
</p> 
"""
soup = BeautifulSoup(html, "lxml")
tag_p = soup.find("p")

new_tag_1 = soup.new_tag("a", href="http://www.example.com")#定义一个新的tag
tag_p.append(new_tag_1)#在哪里加新tag

new_tag_2 = soup.new_tag("c")#定义一个新的tag
soup.find("p").append(new_tag_2)#在哪里加新tag

print(tag_p)

"""
<p>
<a href="http://www.example.com"></a><c></c></p>
"""

例12_1：

from bs4 import BeautifulSoup
html = """
    <headline> 
    </headline>
"""
soup = BeautifulSoup(html, "xml")#解析器为xml
new_tag = soup.new_tag("a", href="http://www.example.com")#定义一个新的tag
soup.find("headline").append(new_tag)
#soup.a.string = "222"#如果想要新标签对有结束标签可以向其添加string,空string也可以
print(soup.headline)

"""
<headline>
<a href="http://www.example.com"/></headline>
"""

注：
从上面两个例子中可以看出
1、tag对象.append()方法不仅可以向tag对象中添加字符串(NavigableString对象)，还可以向tag对象中添加新的tag对象(默认是添加位置为指定tag的中间的最后)

2、解析器为xml和解析为为lxml时，在添加tag对象时输出会有点不一致(为xml时添加的tag对象没有结束标签，当然其他地方也会有不同)，所以感觉BeautifulSoup还是拿来解析HTML文件比较好

insert_before()和insert_after()

1、insert_before()方法在当前tag对象或文本节点前插入内容

2、insert_after()方法在当前tag对象或文本节点后插入内容

例13：

from bs4 import BeautifulSoup
html = """
<b>stop</b>
"""
soup = BeautifulSoup(html, "lxml")

new_tag = soup.new_tag("a")#定义一个新的tag
new_tag.string = "Don't"#定义新tag的string对象
soup.find("b").string.insert_before(new_tag)

print(soup.b)

"""
<b><a>Don't</a>stop</b>
"""

例13_1：

from bs4 import BeautifulSoup
html = """
<b>stop</b>
"""
soup = BeautifulSoup(html, "lxml")

soup.b.string.insert_after(soup.new_string("ever"))

print(soup.b)

"""
<b>stopever</b>
"""

注：
insert_before()和insert_after()：它们两个的使用方法感觉都必须是NavigableString对象.方法()，即tag对象.string.方法()。不然的话不会生效

补充

1、BeautifulSoup库中还有其他方法与属性

2、只要是一个tag对象，就可以使用下面的tag对象的属性和方法(以tag对象开头的)

目的	方法	返回值	描述
获取soup对象	BeautifulSoup("XML文件","解析器")	soup对象	BeautifulSoup方法用于返回一个HTML等文件的Soup对象
获取tag对象	soup对象.标签名	tag对象	返回第一个符合要求的tag对象(标签对)
	Soup对象.find("标签名")	tag对象	返回第一个符合要求的tag对象(标签对)
	Soup对象.find_all("标签名")	列表	返回整个文档中由全部符合要求的tag对象(标签对)组成的列表
	Soup对象.select("标签名")	列表	返回整个文档中由全部符合要求的tag对象(标签对)组成的列表
获取tag对象的name	tag对象.name (可被修改)	字符串	返回一个tag对象的名字。可先通过find_all()方法，返回全部符合要求的tag对象
获取tag对象的attrs	tag对象.attrs (可被修改)	字典	返回一个由tag对象的属性组成的字典。可先通过find_all()方法，返回全部符合要求的tag对象
获取NavigableString对象	tag对象.string (可被修改)	NavigableString对象	返回tag对象的NavigableString对象。可先通过find_all()方法，返回全部符合要求的tag对象
获取NavigableString对象	tag对象.strings	迭代器	返回一个节点下所有的NavigableString对象，包括子孙节点的
获取tag对象的子节点	tag对象.contents	列表	以列表的形式返回节点的所有节点，包括NavigableString对象
获取tag对象的子节点	tag对象.children	list类型的迭代器	以迭代器的形式返回节点下的所有子节点，包括NavigableString对象
获取tag对象的子孙节点	tag对象.descendants	list类型的迭代器	以迭代器的形式返回节点下的所有子孙节点，包括NavigableString对象
其他	.parent		获取某个元素的父节点
	.parents		可以递归得到元素的所有父辈节点
	.next_sibling和.previous_sibling		属性来查询兄弟节点
	.next_siblings和.previous_siblings		可以对当前节点的兄弟节点迭代输出
	.next_elements 和 .previous_elements		可以向前或向后访问文档的解析内容
	find_parents() 和 find_parent()		用来搜索当前节点的父辈节点
	find_next_siblings()		返回所有符合条件的后面的兄弟节点
	find_previous_siblings()		返回所有符合条件的前面的兄弟节点
	find_all_next()		返回所有符合条件的节点
	find_all_previous()		返回所有符合条件的节点
	new_string()		添加一段文本内容到文档中
	insert()		把元素插入到指定的位置
	insert_before()		在当前tag或文本节点前插入内容
	insert_after()		在当前tag或文本节点后插入内容