网站首页 > 博客文章正文

爬虫Python-BeautifulSoup库的基本使用(一)

baijin 2024-09-27 06:44:29 博客文章 5 ℃ 0 评论

BeautifulSoup库

安装

pip install beautifulsoup4
或者使用 sudo pip install beautifulsoup4

安装常见问题

SyntaxError: Missing parentheses in call to 'print'
beautifulsoup3不支持python2,所以下载beautifulsoup是要指定 beautifusoup4

BeautifulSoup库的基本元素

HTML是由多个标签组合起来的标签树,标签常见格式<p>..</p>

BeautifulSoup库是解析、遍历、维护'标签树'的功能库

BeautifulSoup库也叫beautifulsoup4 或 bs4

#两种引入方式 注意大小写，Python对大小写敏感
from bs4 import BeautifulSoup
import bs4

基本元素

tag:标签，最基本的的信息组织，分别用<>和</>标明开头和结尾
Name:标签的名字，<p>..</p>的名字是p,格式是<tag>.name
Attributes:标签的属性，字典形式组织，格式:<tag>.attrs
NavigableString:标签内非属性字符串，<>..</>中的字符串，格式<tag>.string
Comment:标签内字符串的注释部分，一种特殊的Comment类型

简单的例子

目标网址：http://python123.io/ws/demo.html 目的：学习beautifulsoup 的简单使用

import requests
from bs4 import BeautifulSoup
r = requests.get('http://python123.io/ws/demo.html')
r.encoding = r.apparent_encoding
soup = BeautifulSoup(r.text,'html.parser')
tag = soup.a

网页源码

查看HTML中的title标签

print(soup.title)
<title>This is a python demo page</title>

查看HTML中的a标签，如果有多个重复a标签，只返回第一个

print(tag)
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

查看a标签的父节点名称

print(tag.parent.name)
p

获取标签的属性

print(tag.attrs)
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}

获取标签属性的值

print(tag.attrs['href'])
http://www.icourse163.org/course/BIT-268001

查看标签类型

print(type(tag.attrs))
<class 'dict'>

查看标签类型

print(type(tag))
<class 'bs4.element.Tag'>

查看标签之间的字符串<a>字符串</a>

print(tag.string)
Basic Python

查看p标签的内容，p标签里面还有b标签，依然可以获取到内容,查看p标签的string内类型为NavigableString类型,说明NavigableString可跨越多个标签层次

print(soup.p.string)
print(type(soup.p.string))
The demo python introduces several python courses.
<class 'bs4.element.NavigableString'>

注释类型 Comment,b标签的string类型为comment类型，p标签的类型为NavigableString类型，可以根据string类型的不同来判断是否为注释

newsoup = BeautifulSoup("<b><!--This is a Comment--></b><p>This is a Comment</p>",'html.parser')
print(newsoup.b.string)
print(type(newsoup.b.string))
print(newsoup.p.string)
print(type(newsoup.p.string))
This is a Comment
<class 'bs4.element.Comment'>
This is a Comment
<class 'bs4.element.NavigableString'>

基于bs4库的HTML内容遍历方法

HTML内容的结构是树状格式，有三种遍历方式，父节点向下行字节点的下行遍历.contents,子节点向父节点上行遍历.children，平级节点之间的平行遍历.deseendants

.contents: 子节点的列表，将
所有儿子节点存入列表
.children: 子节点的迭代类型，与.contents类似，用于循环遍历儿子节点
.deseendanes：子节点的迭代类型，包含所有子孙节点，用于循环遍历

下行遍历 .contents

查看head的子节点

print(soup.head)
[<title>This is a python demo page</title>]

查看body的子节点，这里注意，对于一个标签的儿子节点，并不仅仅包括标签节点，也包括字符串节点，比如说\N

print(soup.body.contents)
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']

可以使用for .. in 的方式来遍历儿子节点

for child in soup.body.children:
 print(child)
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

上行遍历 .children

.parent: 节点的父亲标签
.parents: 节点先辈标签的迭代类型，用于循环遍历先辈节点

查看title的父节点

print(soup.title.parent)
<head><title>This is a python demo page</title></head>

查看a标签的所有父节点的名称

for parent in soup.a.parents:
 if parent is None:
 print(parent)
 else:
 print(parent.name)

平行遍历 .deseendanes平行遍历限制条件：平行遍历必须在同一个父节点下的各节点间

.next_sibling: 返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling: 返回按照HTML文本顺序的上一个平行节点标签
.next_siblings: 迭代类型，返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings: 迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

查看a节点的下一个节点信息,注意，在标签树中，尽管树形结构是标签的结构来组织，但是标签之间的NavigableString也构成了标签树的节点,也就是说，它的平行标签、儿子标签都可能是NavigableString类型的

print(soup.a.next_sibling)
and

查看a节点的下一个节点的下一个节点的信息

print(soup.a.next_sibling.next_sibling)
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>

查看a节点的上一个节点的信息

print(soup.a.previous_sibling)
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

遍历前续后续节点

for sibling in soup.a.next_siblings:
 print(sibling)
for sibling in soup.a.previous_siblings:
 print(sibling)

prettify()方法, Python3.0可用

pretiify()方法是可以将HTML的内容以标签树的形式进行展示，也可以单独以标签树方式展示单个标签

print(soup.prettify())
#单个标签的树状显示
print(soup.a.prettify())

总结

进入BeautifulSoup官方文档

上一篇：巨细!小姐姐告诉你关于 BeautifulSoup 的一切(上)
下一篇： Python网络编程之BeautifulSoup库的使用(一)

网站首页 > 博客文章正文

爬虫Python-BeautifulSoup库的基本使用(一)

BeautifulSoup库

猜你喜欢

本文暂时没有评论，来添加一个吧(●'◡'●)

取消回复欢迎你发表评论:

网站首页 > 博客文章 正文

爬虫Python-BeautifulSoup库的基本使用(一)

BeautifulSoup库

猜你喜欢

本文暂时没有评论，来添加一个吧(●'◡'●)

取消回复欢迎 你 发表评论:

网站首页 > 博客文章正文

取消回复欢迎你发表评论: