Python 爬虫核心函数与实战技巧：从数据请求到智能解析129

在数据驱动的时代，网络爬虫已成为获取海量信息的重要工具。Python 以其简洁的语法和丰富的第三方库，在爬虫领域占据了举足轻重的地位。掌握 Python 爬虫的常用函数，不仅能提高开发效率，更能让你的爬虫程序更加健壮和智能。本文将作为一名资深程序员，为你详细剖析 Python 爬虫中的核心函数，并结合实战技巧，助你从入门到精通。

我们将从数据请求、页面解析、数据提取、存储到错误处理等多个维度，系统地介绍每个阶段的关键函数，并通过代码示例加深理解。无论你是爬虫新手还是希望提升技能的老手，都能从中获益。

第一章：数据请求与网络交互 — requests 模块

网络爬虫的第一步是向目标网站发送请求，获取其响应内容。Python 的 requests 库是处理 HTTP 请求的事实标准，它以极其简洁的方式封装了复杂的请求细节。

1.1 (url, kwargs)

这是最常用的函数，用于发起 GET 请求，获取网页内容。GET 请求通常用于请求数据，不涉及数据的修改。
url：目标网址。
params：字典或字节序列，作为 URL 的查询字符串（query string）发送。
headers：字典，请求头，用于模拟浏览器行为，如设置 User-Agent、Referer 等，以避免被网站识别为爬虫。
cookies：字典或 CookieJar 对象，发送给服务器的 Cookie。
timeout：浮点数或元组，设置请求超时时间，防止程序长时间阻塞。
proxies：字典，设置代理服务器，用于隐藏真实 IP 或绕过 IP 限制。
verify：布尔值，是否验证 SSL 证书，设置为 False 可忽略 SSL 警告（但安全性较低）。

import requests
def fetch_page(url, headers=None, params=None, timeout=10, proxies=None):
try:
response = (
url,
headers=headers,
params=params,
timeout=timeout,
proxies=proxies
)
response.raise_for_status() # 如果状态码不是200，则抛出HTTPError异常
return
except as e:
print(f"请求失败: {e}")
return None
# 示例
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
html_content = fetch_page('', headers=headers)
if html_content:
print("页面内容已获取，长度:", len(html_content))

1.2 (url, kwargs)

用于发起 POST 请求，通常用于提交表单数据或上传文件。POST 请求将数据放在请求体中发送。
data：字典、列表元组、字节序列或文件对象，用于发送表单数据（application/x-www-form-urlencoded）。
json：字典，用于发送 JSON 格式的数据（application/json）。

import requests
def submit_form(url, data):
try:
response = (url, data=data)
response.raise_for_status()
return
except as e:
print(f"表单提交失败: {e}")
return None
# 示例
post_data = {'username': 'testuser', 'password': 'testpassword'}
# response_text = submit_form('/post', post_data)
# if response_text:
# print("POST 响应:", response_text)

1.3 response 对象常用属性与方法

() 或 () 返回一个 Response 对象，它包含了服务器的响应信息。
.status_code：HTTP 状态码（如 200, 404, 500）。
.text：响应内容的文本形式，通常是 HTML、XML 或纯文本。
.content：响应内容的字节形式，适用于处理图片、视频等二进制数据。
.json()：如果响应内容是 JSON 格式，将其解析为 Python 字典或列表。
.headers：响应头，字典形式。
.cookies：服务器返回的 Cookie，CookieJar 对象。
.url：实际请求的 URL，可能与原始请求 URL 不同（如重定向后）。
.encoding：响应内容的编码方式，requests 会自动推断，也可手动设置。

第二章：页面解析与数据提取 — BeautifulSoup 和 lxml

获取到 HTML 内容后，我们需要从中提取出所需的数据。BeautifulSoup 和 lxml 是 Python 爬虫中最常用的解析库。

2.1 BeautifulSoup (bs4 模块)

BeautifulSoup 是一个功能强大且易于使用的库，用于从 HTML 或 XML 文件中提取数据。它会自动将复杂的 HTML 文档转换成一个复杂的树形结构，每个节点都是 Python 对象，可以方便地遍历和搜索。

2.1.1 BeautifulSoup(markup, parser)

BeautifulSoup 的构造函数，用于创建解析器对象。
markup：要解析的 HTML 或 XML 字符串。
parser：解析器，推荐使用 'lxml'（需要额外安装 lxml 库），其次是 ''。

from bs4 import BeautifulSoup
def parse_html(html_content):
soup = BeautifulSoup(html_content, 'lxml') # 推荐使用lxml解析器
return soup
# 示例HTML
sample_html = """
<html>
<head><title>测试页面</title></head>
<body>
<h1 id="title" class="main-title">欢迎访问</h1>
<div class="container">
<p class="intro">这是一段介绍文本。</p>
<ul>
<li><a href="/item1">商品1</a></li>
<li class="special"><a href="/item2">商品2</a></li>
<li><a href="/item3">商品3</a></li>
</ul>
<span>价格: <b>¥123.45</b></span>
</div>
</body>
</html>
"""
soup = parse_html(sample_html)

2.1.2 find(name, attrs, recursive, text, kwargs)

查找匹配的第一个标签。
name：标签名（如 'div', 'a'），可以是字符串、正则表达式、列表或函数。
attrs：字典，标签属性（如 {'class': 'intro', 'id': 'title'}）。
recursive：布尔值，是否递归查找子孙节点。
text：字符串或正则表达式，查找标签内的文本内容。

# 查找第一个 h1 标签
h1_tag = ('h1')
print(f"H1 标题: {h1_tag.get_text()}")
# 查找 class 为 "intro" 的 p 标签
intro_p = ('p', class_='intro') # class_ 是因为 class 是 Python 关键字
print(f"介绍文本: {}")

2.1.3 find_all(name, attrs, recursive, text, limit, kwargs)

查找所有匹配的标签，返回一个列表。
limit：整数，限制返回结果的数量。

# 查找所有 li 标签
all_li_tags = soup.find_all('li')
for li in all_li_tags:
print(f"列表项: {}")
# 查找 class 为 "special" 的 li 标签
special_li = soup.find_all('li', class_='special')
print(f"特殊列表项: {[ for li in special_li]}")

2.1.4 select(selector)

使用 CSS 选择器查找标签，返回一个列表。这是非常强大和灵活的方法。
#id：通过 id 查找。
.class：通过 class 查找。
tag：通过标签名查找。
tag#id, ：组合查找。
tag1 tag2：查找 tag1 下面的所有 tag2。
tag1 > tag2：查找 tag1 的直接子标签 tag2。
[attr]：查找带有某个属性的标签。
[attr="value"]：查找属性值为特定值的标签。

# 通过 CSS 选择器查找 H1 标签
h1_tag_css = ('#title')[0]
print(f"H1 标题 (CSS): {}")
# 查找所有商品链接
product_links = ('.container ul li a')
for link in product_links:
print(f"商品链接: {link.get_text()} - {link['href']}")
# 查找价格
price_tag = soup.select_one('.container span b') # select_one 类似 find
if price_tag:
print(f"价格: {}")

2.1.5 标签对象常用属性与方法

find() 或 select() 返回的标签对象具有以下常用属性和方法：
.name：标签名。
.attrs：所有属性，字典形式。
['attribute_name']：通过键访问属性值（如 tag['href']）。
.get('attribute_name', default_value)：更安全的属性访问方式。
.text 或 .get_text(separator=' ', strip=False)：获取标签内的所有文本内容。strip=True 可以去除空白符。
.string：如果标签只有一个子节点且为文本节点，则返回该文本内容；否则返回 None。
.children：迭代器，访问直接子节点。
.contents：列表，访问直接子节点（包括字符串和标签）。
.parent：父标签。

2.2 lxml 与 XPath

lxml 是一个高性能的 XML/HTML 解析库，它提供了对 XPath 和 CSS 选择器的支持。对于大型或复杂的 HTML 文档，lxml 通常比 BeautifulSoup 速度更快。

2.2.1 (text)

将 HTML 文本解析成一个 Element 对象。from lxml import etree
def parse_html_lxml(html_content):
html_element = (html_content)
return html_element
lxml_tree = parse_html_lxml(sample_html)

2.2.2 (xpath_expression)

使用 XPath 表达式查找元素。XPath 是一种在 XML 文档中定位节点的语言，功能非常强大。
//tag：选择所有名为 tag 的元素。
//tag[@attr="value"]：选择所有带有特定属性值的 tag 元素。
//tag/child：选择 tag 元素的直接子元素 child。
//tag//grandchild：选择 tag 元素的任意后代元素 grandchild。
/html/body/div/p：绝对路径。
text()：获取元素的文本内容。
@attr：获取元素的属性值。

# 示例 XPath 表达式
# 查找 H1 标题
h1_text_lxml = ('//h1[@id="title"]/text()')
print(f"H1 标题 (XPath): {h1_text_lxml[0] if h1_text_lxml else 'N/A'}")
# 查找所有商品链接的 href 属性
product_hrefs_lxml = ('//div[@class="container"]//li/a/@href')
print(f"商品链接 (XPath - href): {product_hrefs_lxml}")
# 查找所有商品链接的文本
product_names_lxml = ('//div[@class="container"]//li/a/text()')
print(f"商品名称 (XPath - text): {product_names_lxml}")
# 查找价格
price_lxml = ('//div[@class="container"]/span/b/text()')
print(f"价格 (XPath): {price_lxml[0] if price_lxml else 'N/A'}")

第三章：正则表达式高级匹配 — re 模块

虽然 BeautifulSoup 和 lxml 能处理绝大部分结构化数据，但在处理非标准格式、文本内容中嵌入的特定模式（如电话号码、邮箱、日期）时，正则表达式（re 模块）是不可或缺的利器。

3.1 (pattern, flags=0)

编译正则表达式模式，生成一个正则表达式对象。对于多次使用的模式，编译可以提高效率。

3.2 (pattern, string, flags=0)

在字符串中查找第一个匹配项。如果找到，返回一个匹配对象（Match Object），否则返回 None。

3.3 (pattern, string, flags=0)

在字符串中查找所有非重叠的匹配项，返回一个字符串列表。

3.4 Match Object 常用方法
.group(index)：返回指定捕获组匹配到的字符串。group(0) 返回整个匹配项。
.groups()：返回所有捕获组匹配到的字符串，作为元组。
.groupdict()：返回所有命名捕获组匹配到的字符串，作为字典。

import re
def extract_with_regex(text, pattern):
compiled_pattern = (pattern)
matches = (text)
return matches
def extract_single_match(text, pattern):
compiled_pattern = (pattern)
match = (text)
if match:
return (1) # 获取第一个捕获组的内容
return None
text_with_contacts = "联系我们：电话 138-0000-1234，邮箱 test@。另外电话：010-87654321。"
# 示例：提取所有电话号码
phone_numbers = extract_with_regex(text_with_contacts, r'\d{3,4}-\d{7,8}|\d{11}')
print(f"电话号码: {phone_numbers}")
# 示例：提取第一个邮箱地址
email_address = extract_single_match(text_with_contacts, r'(\w+@\w+\.\w+)')
print(f"邮箱地址: {email_address}")
# 提取价格（包含货币符号）
price_text = "价格: <b>¥123.45</b>"
price_value = extract_single_match(price_text, r'¥(\d+\.\d{2})')
print(f"提取的价格值: {price_value}")

第四章：数据存储与持久化 — csv 和 json 模块

爬取到的数据最终需要存储起来，以便后续分析和使用。CSV 和 JSON 是两种常见且易于处理的数据格式。

4.1 csv 模块

用于读写 CSV（Comma Separated Values）文件，适合存储表格型数据。

4.1.1 (csvfile, dialect='excel', fmtparams)

创建一个 writer 对象，用于将数据写入 CSV 文件。
csvfile：一个支持写入模式（如 'w' 或 'wb'）的文件对象。

4.1.2 writerow(row)

将一行数据写入 CSV 文件。row 应该是一个可迭代对象（如列表或元组）。

4.1.3 writerows(rows)

将多行数据写入 CSV 文件。rows 应该是一个包含可迭代对象的列表。import csv
def save_to_csv(data_list, filename, headers=None):
with open(filename, 'w', newline='', encoding='utf-8') as f:
writer = (f)
if headers:
(headers)
(data_list)
print(f"数据已保存到 {filename}")
# 示例
product_data = [
['商品1', '/item1', '123.45'],
['商品2', '/item2', '245.99'],
['商品3', '/item3', '89.00']
]
headers = ['商品名称', '链接', '价格']
save_to_csv(product_data, '', headers)

4.2 json 模块

用于处理 JSON（JavaScript Object Notation）格式的数据，适合存储结构化、嵌套的数据。

4.2.1 (obj, kwargs)

将 Python 对象编码为 JSON 格式的字符串。
indent：整数，用于美化输出，指定缩进级别。
ensure_ascii：布尔值，是否强制使用 ASCII 编码。如果包含中文，应设为 False。

4.2.2 (s, kwargs)

将 JSON 格式的字符串解码为 Python 对象。

4.2.3 (obj, fp, kwargs)

将 Python 对象编码为 JSON 格式，并写入文件对象 fp。

4.2.4 (fp, kwargs)

从文件对象 fp 读取 JSON 格式的数据，并解码为 Python 对象。import json
def save_to_json(data, filename):
with open(filename, 'w', encoding='utf-8') as f:
(data, f, ensure_ascii=False, indent=4)
print(f"数据已保存到 {filename}")
def load_from_json(filename):
with open(filename, 'r', encoding='utf-8') as f:
data = (f)
print(f"数据已从 {filename} 加载")
return data
# 示例
product_data_json = [
{'name': '商品1', 'link': '/item1', 'price': 123.45},
{'name': '商品2', 'link': '/item2', 'price': 245.99},
{'name': '商品3', 'link': '/item3', 'price': 89.00}
]
save_to_json(product_data_json, '')
loaded_data = load_from_json('')
print(loaded_data)

第五章：错误处理与爬虫健壮性

优秀的爬虫程序不仅仅是获取数据，更要能够优雅地处理各种异常情况，保证程序的健壮性。

5.1 try...except...finally 语句块

这是 Python 中处理异常的核心机制，能捕获并处理代码执行过程中可能发生的错误。
：requests 库所有异常的基类。
：HTTP 错误（如 4xx, 5xx 状态码）。
：网络连接问题。
：请求超时。
IndexError：列表或元组索引越界。
AttributeError：访问不存在的属性。
TypeError：类型错误。

import time
def robust_fetch_page(url, retries=3, delay=5):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
for i in range(retries):
try:
print(f"尝试请求 {url} (第 {i+1} 次尝试)")
response = (url, headers=headers, timeout=10)
response.raise_for_status()
return
except as e:
print(f"HTTP 错误 ({.status_code}): {url}")
if .status_code == 404: # 404页面直接跳过
return None
except :
print(f"连接错误，等待 {delay} 秒后重试...")
except :
print(f"请求超时，等待 {delay} 秒后重试...")
except as e:
print(f"发生未知请求错误: {e}")
(delay) # 每次重试前等待
print(f"多次重试失败，无法获取 {url}")
return None
# 示例
# non_existent_page = robust_fetch_page('/non_existent_page_123')
# if non_existent_page is None:
# print("页面确实不存在或无法访问。")
# else:
# print("获取到页面内容。")

5.2 (seconds)

让程序暂停执行指定秒数。这是爬虫中的“君子协定”，用于模拟用户行为，避免对目标网站造成过大压力，从而减少被封 IP 的风险。import time
for i in range(5):
print(f"正在执行任务 {i+1}...")
(1) # 每秒执行一次
print("任务完成！")

第六章：构建可复用函数与模块化实践

将上述常用功能封装成独立的函数，可以大大提高代码的可读性、可维护性和复用性，是构建高效爬虫项目的关键。import requests
from bs4 import BeautifulSoup
import csv
import time
import re
# 1. 封装数据请求函数
def get_html_content(url, headers=None, proxies=None, timeout=10, retries=3, delay=5):
"""
鲁棒地获取指定URL的HTML内容
"""
default_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6'
}
if headers:
(headers)
for i in range(retries):
try:
response = (url, headers=default_headers, proxies=proxies, timeout=timeout)
response.raise_for_status()
return
except as e:
if .status_code == 404:
print(f"URL Not Found (404): {url}")
return None
print(f"HTTP Error {.status_code} for {url}. Retrying...")
except as e:
print(f"Request Error for {url}: {e}. Retrying...")
(delay * (i + 1)) # 指数退避策略
print(f"Failed to fetch {url} after {retries} retries.")
return None
# 2. 封装页面解析函数
def parse_product_list(html_content):
"""
解析商品列表页面，提取商品信息
"""
if not html_content:
return []
soup = BeautifulSoup(html_content, 'lxml')
products = []
# 假设商品列表在 class 为 'product-item' 的 div 中
product_items = ('.product-item') # 请根据实际页面结构修改选择器
for item in product_items:
try:
name_tag = item.select_one('.product-name a')
name = name_tag.get_text(strip=True) if name_tag else 'N/A'
link = name_tag['href'] if name_tag and 'href' in else 'N/A'

price_tag = item.select_one('.product-price b')
price_text = price_tag.get_text(strip=True) if price_tag else 'N/A'
# 使用正则表达式从价格字符串中提取数字
price_match = (r'\d+\.?\d*', price_text)
price = float((0)) if price_match else None

({
'name': name,
'link': link,
'price': price
})
except Exception as e:
print(f"解析单个商品失败: {e}")
continue
return products
# 3. 封装数据存储函数
def save_data_to_csv(data, filename, fieldnames):
"""
将数据写入CSV文件
"""
with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
writer = (csvfile, fieldnames=fieldnames)
()
(data)
print(f"数据成功保存到 {filename}")
# 实际爬取流程示例
def run_spider(start_url, output_csv=''):
print(f"开始爬取: {start_url}")
all_products = []

# 模拟多页爬取
for page_num in range(1, 3): # 爬取2页
current_url = f"{start_url}?page={page_num}" # 假设分页参数是 ?page=
print(f"正在获取第 {page_num} 页: {current_url}")
html = get_html_content(current_url)

if html:
products_on_page = parse_product_list(html)
(products_on_page)
print(f"第 {page_num} 页获取到 {len(products_on_page)} 条商品信息。")

(2) # 礼貌性等待
if all_products:
fieldnames = ['name', 'link', 'price']
save_data_to_csv(all_products, output_csv, fieldnames)
print(f"总共获取到 {len(all_products)} 条商品信息。")
else:
print("未获取到任何商品信息。")
# 为了运行示例，我们需要一个模拟的HTML内容，因为''没有真实商品列表
# 以下是基于上面sample_html的扩展模拟
# 请注意，这只是一个模拟，实际运行时应替换为真实的URL和更复杂的解析逻辑
if __name__ == '__main__':
# 模拟 get_html_content 返回不同页面的HTML
def get_mock_html(url):
page = 1
if "?page=2" in url:
page = 2

mock_html_template = """
<html><body>
<div class="product-list">
<div class="product-item">
<p class="product-name"><a href="/item{p}1">模拟商品 {p}-1</a></p>
<p class="product-price">价格: <b>¥{price1:.2f}</b></p>
</div>
<div class="product-item">
<p class="product-name"><a href="/item{p}2">模拟商品 {p}-2</a></p>
<p class="product-price">价格: <b>¥{price2:.2f}</b></p>
</div>
</div>
</body></html>
"""
if page == 1:
return (p=page, price1=100.50, price2=200.75)
elif page == 2:
return (p=page, price1=300.25, price2=400.99)
return None
# 替换实际的 get_html_content 函数为模拟函数
original_get_html_content = get_html_content
get_html_content = get_mock_html # 将我们的模拟函数赋值给爬虫流程中的函数
run_spider('/products')
# 恢复原函数，以防在其他地方使用
get_html_content = original_get_html_content

总结与最佳实践

通过本文，我们深入学习了 Python 爬虫中数据请求、页面解析、数据提取和存储的核心函数，并探讨了如何通过错误处理和模块化来构建健壮、可维护的爬虫程序。作为一名专业的程序员，在进行爬虫开发时，除了掌握这些技术细节，更要注重以下最佳实践和道德规范：
尊重：在爬取网站前，首先检查网站根目录下的文件，了解哪些内容允许爬取，哪些不允许。
设置合理的 User-Agent：模拟主流浏览器，避免使用默认的 requests 或 Python 标识。
控制请求频率：使用 () 函数在请求之间设置随机或固定的延时，避免短时间内发送大量请求，给服务器造成压力，导致被封禁。
处理异常和重试机制：对网络请求、页面解析等可能出现的异常进行捕获和处理，并实现合理的重试逻辑，提高爬虫的容错性。
使用代理 IP：对于大规模爬取或反爬机制较严格的网站，使用代理 IP 池可以有效降低被封的风险。
数据清洗与验证：爬取到的数据可能存在脏数据，需要进行清洗、去重和格式验证，确保数据质量。
增量爬取与去重：对于需要长期运行的爬虫，考虑增量爬取，只获取最新或更新的数据，并对数据进行去重处理。
合法合规：确保你的爬虫行为符合法律法规和网站的使用条款，不爬取受保护的个人隐私信息，不进行攻击性行为。

Python 爬虫的世界广阔而深邃，除了上述基础函数，还有像 Scrapy 这样的专业爬虫框架，以及 Selenium/Playwright 用于处理动态页面等更高级的工具。掌握了这些核心函数和实践技巧，你将为进一步探索更复杂的爬虫技术打下坚实的基础。祝你在数据海洋中乘风破浪！

2025-11-12

上一篇：Python 数字类型与数值计算全指南：从基础到高级编程实践

下一篇：Python分时数据：从采集、清洗到分析与预测的全栈指南