网站首页 > 博客文章正文

探索网络爬虫:使用 Requests 和 BeautifulSoup 获取数据

baijin 2024-12-18 14:40:48 博客文章 8 ℃ 0 评论

在 Python 中，Requests 和 BeautifulSoup 是开发网络爬虫的基础工具，分别负责发送 HTTP 请求和解析网页内容。以下是它们的功能和使用示例。

1. Requests：简化 HTTP 请求

特点

提供友好的 API 接口，用于发送 GET、POST 等 HTTP 请求。
支持自动处理 Cookie、Session 和认证。

基础用法

import requests

# 发送 GET 请求

response = requests.get("https://example.com")

print(response.status_code) # 输出 HTTP 状态码

print(response.text) # 输出网页内容

带参数和头部的请求

python

url = "https://httpbin.org/get"

params = {"key": "value"}

headers = {"User-Agent": "Mozilla/5.0"}

response = requests.get(url, params=params, headers=headers)

print(response.json()) # 解析 JSON 数据

异常处理

python

try:

response = requests.get("https://example.com", timeout=5)

response.raise_for_status() # 检查响应状态

except requests.exceptions.RequestException as e:

print(f"请求失败: {e}")

2. BeautifulSoup：解析网页内容

特点

基于 HTML 或 XML 的文档解析。
提供简洁的 API 获取页面的特定元素。

安装

bash

pip install beautifulsoup4

基础用法

python

from bs4 import BeautifulSoup

html = """

<html>

<head><title>Example</title></head>

<body>

<p class="content">Hello, World!</p>

</body>

</html>

"""

soup = BeautifulSoup(html, "html.parser")

print(soup.title.string) # 输出 'Example'

print(soup.find("p", class_="content").text) # 输出 'Hello, World!'

从网页提取数据

python

import requests

from bs4 import BeautifulSoup

url = "https://example.com"

response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser")

# 提取所有链接

links = [a["href"] for a in soup.find_all("a", href=True)]

print(links)

3. Requests 与 BeautifulSoup 的结合

以下是一个完整的爬虫示例，用于获取指定网站的标题和所有链接：

python

import requests

from bs4 import BeautifulSoup

url = "https://example.com"

headers = {"User-Agent": "Mozilla/5.0"}

response = requests.get(url, headers=headers)

if response.status_code == 200:

soup = BeautifulSoup(response.text, "html.parser")

title = soup.title.string # 获取页面标题

links = [a["href"] for a in soup.find_all("a", href=True)] # 获取所有链接

print(f"页面标题: {title}")

print(f"页面链接: {links}")

else:

print(f"请求失败，状态码: {response.status_code}")

4. 进阶功能

分页爬取

利用 URL 的查询参数爬取多页内容：

python

base_url = "https://example.com/page="

for i in range(1, 6): # 爬取前 5 页

response = requests.get(base_url + str(i))

if response.ok:

soup = BeautifulSoup(response.text, "html.parser")

print(f"第 {i} 页内容已抓取")

处理动态网页

对于需要处理 JavaScript 渲染的页面，使用 Selenium 或 Playwright 辅助加载。

5. 常见问题与解决方法

请求被拒绝：尝试添加 User-Agent 头部，模拟真实用户请求。
反爬虫机制：控制请求频率，使用代理 IP。
解析失败：检查页面结构变化，更新选择器。

通过 Requests 和 BeautifulSoup 的组合，可以快速开发功能强大的网络爬虫，轻松实现数据抓取和处理任务！

上一篇： BeautifulSoup，一个解析HTML与XML文档无敌的 Python 库!
下一篇： Python爬虫工具实现网页图片爬取并下载到本地?

网站首页 > 博客文章正文

探索网络爬虫:使用 Requests 和 BeautifulSoup 获取数据

1. Requests：简化 HTTP 请求

特点

基础用法

带参数和头部的请求

异常处理

2. BeautifulSoup：解析网页内容

特点

安装

基础用法

从网页提取数据

3. Requests 与 BeautifulSoup 的结合

4. 进阶功能

分页爬取

处理动态网页

5. 常见问题与解决方法

猜你喜欢

本文暂时没有评论，来添加一个吧(●'◡'●)

取消回复欢迎你发表评论:

网站首页 > 博客文章 正文

探索网络爬虫:使用 Requests 和 BeautifulSoup 获取数据

1. Requests：简化 HTTP 请求

特点

基础用法

带参数和头部的请求

异常处理

2. BeautifulSoup：解析网页内容

特点

安装

基础用法

从网页提取数据

3. Requests 与 BeautifulSoup 的结合

4. 进阶功能

分页爬取

处理动态网页

5. 常见问题与解决方法

猜你喜欢

本文暂时没有评论，来添加一个吧(●'◡'●)

取消回复欢迎 你 发表评论:

网站首页 > 博客文章正文

取消回复欢迎你发表评论: