BeautifulSoup库

简单使用

1 2	from bs4 import BeautifulSoup soup = BeautifulSoup("<p>python</p>","html.parser")

教授管这叫美丽汤，给它个文本，再加个解析器就能熬汤了。

>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> r.text
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>> print(soup.prettify())
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

prettify()方法可以将文本美化排版。

BeautifulSoup库的理解

BeautifulSoup类

BeautifulSoup库又叫beatifulsoup4,bs4

一般引用BeautifulSoup类就够了

1	from bs4 import BeautifulSoup

至于BeautifulSoup这个类其实跟HTML文档是等价的，它提供了一些方法来操作标签树以提取我们需要的信息。

解析器

BeautifulSoup库支持四种解析器

解析器	使用方法	条件
bs4的HTML解析器	BeautifulSoup(mk,'html.parser')	安装bs4库
lxml的HTML解析器	BeautifulSoup(mk,'lxml')	pip install lxml
lxml的XML解析器	BeautifulSoup(mk,'xml')	pip install lxml
html5lib的解析器	BeautifulSoup(mk,'html5lib')	pip install html5lib

基本元素

基本元素	说明
Tag	标签，最基本的信息组织单元，分别用<>和</>标明开头和结尾
Name	标签的名字，<p>…</p>的名字是'p'，格式：<tag>.name
Attributes	标签的属性，字典形式组织，格式：<tag>.attrs
NavigableString	标签内非属性字符串，<>…</>中字符串，格式：<tag>.string
Comment	标签内字符串的注释部分，一种特殊的Comment类型

用上面的demo例子来实验。
1.通过soup.<Tag>来获得标签，当标签存在多个时只能获取第一个。

1 2	>>> soup.title <title>This is a python demo page</title>

2.通过Tag.name获取标签名字，Tag.parent获取上级标签

>>> soup.a.name
'a'
>>> soup.a.parent.name
'p'

3.通过Tag.attrs获取标签属性

1 2	>>> soup.a.attrs {'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}

返回一个字典，如标签没有属性则返回空字典。

4.通过Tag.string来获得标签中的内容

>>> soup.a
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.a.string
'Basic Python'
>>> soup.p
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>> soup.p.string
'The demo python introduces several python courses.'

5.NavigableString和Comment

>>> newsoup = BeautifulSoup("<p><!--a--></p><b>a</b>","html.parser")
>>> newsoup.p.string
'a'
>>> type(newsoup.p.string)
<class 'bs4.element.Comment'>
>>> newsoup.b.string
'a'
>>> type(newsoup.b.string)
<class 'bs4.element.NavigableString'>

了解即可

利用bs4库遍历HTML内容

标签树的下行遍历

属性	说明
.contents	子节点的列表，将所有儿子节点存入列表
.children	子节点的迭代类型，与.contents类似，用于循环遍历儿子节点
.descendants	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

>>> soup.head
<head><title>This is a python demo page</title></head>
>>> soup.head.contents
[<title>This is a python demo page</title>]
>>>

for child in soup.body.children:
	print(child)
    
for child in soup.body.descendants:
	print(child)

标签树的上行遍历

属性	说明
.parent	节点的父亲标签
.parents	节点先辈标签的迭代类型，用于循环遍历先辈节点

parents用于遍历父节点，父节点的父节点…

>>> for parent in soup.a.parents:
	if parent is None:
		print(parent)
	else:
		print(parent.name)

		
p
body
html
[document]

标签树的平行遍历

属性	说明
.next_sibling	返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling	返回按照HTML文本顺序的上一个平行节点标签
.next_siblings	迭代类型，返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings	迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

next_sibling和previous_sibling同理

>>> soup.a
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.a.next_sibling
' and '
>>> soup.a.next_sibling.next_sibling
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>

而.next_siblings和.previous_siblings

for sibling in soup.a.next_sibling: 
	print(sibling)

for sibling in soup.a.previous_sibling:
	print(sibling)

需要注意的是：同一个父标签的节点才构成平行关系，例如<head>标签下的节点和<body>下的节点不构成平行关系

节点

通过这些属性生成器生成的节点可迭代对象中，不仅仅是包含尖括号<>的标签，还有一些字符内容,如'\n'以及标签中的没有被标签包裹的字符串(懂我意思吗)，这些都可以是节点。

格式化与编码

格式化

前面讲了prettify()方法可以对文本进行美化排版，它不仅支持BeautifulSoup类，还可以对标签的内容进行排版。（实际上BeautifulSoup类对应的就是标签树的根节点）

>>> soup.p
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>> print(soup.p.prettify())
<p class="title">
 <b>
  The demo python introduces several python courses.
 </b>
</p>

编码

BeautifulSoup对所有传入的HTML文档或者字符串都会转换为UTF-8编码。

信息提取

信息标记

XML,JSON,YAML

1.XML

1
2
3

<name>...</name> //标签中有内容时
<name/> //标签中无内容时
<!--  --> //注释

2.JSON

{
"key" : "value",
"key1" : ["value1","value2"],
"key2" : {"subkey" : "subvalue"}
}

3.YAML

#无类型键值对
key : value
key1 : #注释
-value1  #并列
-value2
key2 :
    subkey : subvalue  #嵌套

XML：最早通用信息标记语言，可拓展性好，繁琐。

JSON：信息有类型，适合程序处理（js），较XML简洁

YAML：信息无类型，文本信息比例最高，可读性好。

信息提取方法

先解析信息的标记形式，再提取关键信息
需要标记解析器

优点：信息解析准确
缺点：提取繁琐，速度慢
无视标记形式，直接搜索关键信息

需要搜索函数

优点：过程简洁，速度快
缺点：提取结果准确性和信息内容相关

两者结合就完事了

实例：提取页面所有链接

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup = BeautifulSoup(demo.text,"html.parser")
>>> for link in soup.find_all('a'):
	print(link.get('href'))

	
http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001
>>>

查找函数

1	<>.find_all(name,attrs,recursive, string, **kwargs) #返回一个由搜索结果组成的列表

name : 对标签名称的检索字符串
attrs: 对标签属性值的检索字符串，可标注属性检索
recursive: 是否对子孙全部检索，默认True
string: <>…</>中字符串区域的检索字符串

拓展方法

方法	说明
<>.find()	搜索且只返回一个结果，同.find_all()参数
<>.find_parents()	在先辈节点中搜索，返回列表类型，同.find_all()参数
<>.find_parent()	在先辈节点中返回一个结果，同.find()参数
<>.find_next_siblings()	在后续平行节点中搜索，返回列表类型，同.find_all()参数
<>.find_next_sibling()	在后续平行节点中返回一个结果，同.find()参数
<>.find_previous_siblings()	在前序平行节点中搜索，返回列表类型，同.find_all()参数
<>.find_previous_sibling()	在前序平行节点中返回一个结果，同.find()参数