Python高效遍历文件系统：掌握os与pathlib的艺术148

在日常的编程任务中，文件系统操作是不可或缺的一部分。无论是数据分析、日志处理、文件管理，还是自动化脚本，我们都经常需要遍历文件夹，查找、读取或处理其中的文件。Python以其简洁强大的标准库，为文件系统遍历提供了多种高效且灵活的工具。本文将深入探讨Python中用于遍历文件夹和文件的核心模块，包括经典的os模块和现代的pathlib模块，并通过丰富的代码示例，助你成为文件系统操作的专家。

一、文件系统遍历的重要性与应用场景

文件系统遍历不仅仅是简单地列出文件，它更是许多复杂任务的基础。以下是一些典型的应用场景：

数据处理与分析：批量读取某个目录下所有CSV、JSON或文本文件进行数据清洗、汇总或模型训练。

日志管理：定期扫描日志目录，查找特定日期、大小或内容的日志文件，进行归档、分析或删除。

文件组织与管理：自动将散乱的文件按类型、创建日期或修改日期分类整理到不同文件夹中。

备份与同步：比较两个目录树，找出差异文件并进行同步或备份。

代码审查与静态分析：遍历项目目录，查找特定类型的文件（如.py, .js），分析代码结构或查找潜在问题。

文件搜索工具：实现一个自定义的搜索工具，根据文件名、扩展名或内容快速定位文件。

理解并熟练掌握Python提供的文件遍历方法，将极大地提升你的开发效率和代码质量。

二、经典之选：os 模块的基础与进阶

os模块是Python与操作系统交互的标准接口，它提供了大量函数来处理文件和目录。在文件遍历方面，os模块主要提供了()和()两个核心函数。

2.1 ()：初探目录

(path)函数用于返回指定目录下所有文件和文件夹的名称列表。它只会列出当前目录下的直接子项，不会递归进入子目录。

基本用法：import os
# 假设当前目录下有一个名为 'my_directory' 的文件夹
# my_directory/
# ├──
# ├── subdir1/
# │ └──
# └──
target_path = 'my_directory'
if (target_path) and (target_path):
print(f"Listing contents of: {target_path}")
contents = (target_path)
for item in contents:
print(item)
else:
print(f"Directory '{target_path}' does not exist or is not a directory.")
# 输出可能为:
# Listing contents of: my_directory
#
# subdir1
#

注意点：

()返回的只是文件名或文件夹名，不包含完整路径。若要获取完整路径，需要与父目录路径进行拼接，通常使用()。

它不区分文件和文件夹，需要结合()和()来判断。

手动递归遍历（使用 ()）：

虽然()不具备递归能力，但我们可以利用它结合递归函数来模拟实现完整的目录树遍历。这对于理解递归原理和文件系统结构很有帮助，但在实际项目中通常会使用更高级的()。import os
def list_files_recursively(start_path):
for item in (start_path):
item_path = (start_path, item) # 构建完整路径
if (item_path):
print(f"File: {item_path}")
elif (item_path):
print(f"Directory: {item_path}")
list_files_recursively(item_path) # 递归调用
# 示例使用
# list_files_recursively('my_directory')

2.2 ()：深度遍历的利器

(top, topdown=True, onerror=None, followlinks=False)是Python文件系统遍历中最强大和常用的工具。它会生成一个三元组的序列，对目录树进行深度优先或广度优先的遍历。

每次迭代，()都会返回一个三元组：(dirpath, dirnames, filenames)：

dirpath：当前正在遍历的目录的路径字符串。

dirnames：dirpath下所有子目录的名称列表（不包含路径）。

filenames：dirpath下所有文件的名称列表（不包含路径）。

基本用法：import os
# 创建一个示例目录结构
# root_folder/
# ├──
# ├── sub_dir1/
# │ ├──
# │ └── sub_sub_dir/
# │ └──
# └── sub_dir2/
# └──
def create_dummy_structure(base_path):
((base_path, 'sub_dir1', 'sub_sub_dir'), exist_ok=True)
((base_path, 'sub_dir2'), exist_ok=True)
with open((base_path, ''), 'w') as f: ('a')
with open((base_path, 'sub_dir1', ''), 'w') as f: ('b')
with open((base_path, 'sub_dir1', 'sub_sub_dir', ''), 'w') as f: ('c')
with open((base_path, 'sub_dir2', ''), 'w') as f: ('d')
dummy_folder = 'root_folder'
if not (dummy_folder):
create_dummy_structure(dummy_folder)
print(f"Traversing directory: {dummy_folder}")
for root, dirs, files in (dummy_folder):
print(f"Current Directory: {root}")
print(f" Subdirectories: {dirs}")
print(f" Files: {files}")
for file in files:
full_file_path = (root, file)
print(f" Found File: {full_file_path}")
# 输出示例 (顺序可能因操作系统而异):
# Traversing directory: root_folder
#
# Current Directory: root_folder
# Subdirectories: ['sub_dir1', 'sub_dir2']
# Files: ['']
# Found File: root_folder/
#
# Current Directory: root_folder/sub_dir1
# Subdirectories: ['sub_sub_dir']
# Files: ['']
# Found File: root_folder/sub_dir1/
#
# Current Directory: root_folder/sub_dir1/sub_sub_dir
# Subdirectories: []
# Files: ['']
# Found File: root_folder/sub_dir1/sub_sub_dir/
#
# Current Directory: root_folder/sub_dir2
# Subdirectories: []
# Files: ['']
# Found File: root_folder/sub_dir2/

()的参数：

topdown (默认为True)：

True (自上而下)：先访问父目录，再访问子目录。在处理dirnames列表时，可以修改它来控制遍历顺序，或者从列表中移除目录以跳过某些子目录的遍历。

False (自下而上)：先访问子目录，再访问父目录。这种模式常用于需要先处理子目录内容（如计算文件大小）才能汇总父目录信息的场景。

onerror：一个可选的回调函数，当遇到无法访问的目录时，该函数会被调用，并传入一个OSError实例作为参数。默认情况下，错误会被忽略。

followlinks (默认为False)：如果为True，()会遍历符号链接指向的目录。请谨慎使用，以避免无限递归（如果存在循环链接）。

控制遍历行为：

当topdown为True时，我们可以通过修改dirs列表来改变或修剪遍历路径：import os
# 遍历时跳过所有以 'temp_' 开头的目录
base_dir = 'root_folder'
if not (base_dir):
create_dummy_structure(base_dir) # 确保示例目录存在
((base_dir, 'temp_data'), exist_ok=True)
with open((base_dir, 'temp_data', ''), 'w') as f: ('tmp')

print(f"Traversing directory '{base_dir}' while skipping 'temp_' folders:")
for root, dirs, files in (base_dir, topdown=True):
# 修改 dirs 列表以跳过特定目录
# 注意：这里需要在修改前创建一个副本，或者直接操作 dirs = [d for d in dirs if not ('temp_')]
dirs[:] = [d for d in dirs if not ('temp_')] # 就地修改 dirs 列表
print(f"Current Directory: {root}")
print(f" Subdirectories (to be visited): {dirs}")
print(f" Files: {files}")
for file in files:
full_file_path = (root, file)
print(f" Found File: {full_file_path}")
# 可以观察到 'temp_data' 及其内容未被访问

三、现代之选：pathlib 模块的优雅与面向对象

pathlib模块是Python 3.4引入的，它提供了一种面向对象的路径操作方式，使得路径处理更加直观、安全和平台无关。它封装了文件系统路径，使其成为具有方法和属性的对象，极大地提高了代码的可读性和可维护性。

3.1 Path 对象与基本操作

使用pathlib的第一步是创建Path对象。Path对象可以直接表示文件或目录的路径。from pathlib import Path
# 创建 Path 对象
p = Path('/usr/local/bin')
q = Path('my_directory/')
r = Path('.') # 当前目录
print(f"Absolute path: {()}")
print(f"Parent directory: {}")
print(f"File name: {}")
print(f"File stem (without suffix): {}")
print(f"File suffix: {}")
print(f"Is directory? {r.is_dir()}")
print(f"Does file exist? {()}")
# 路径拼接（使用 / 运算符，更加直观）
new_path = p / 'python' / ''
print(f"Joined path: {new_path}") # 输出: /usr/local/bin/python/

3.2 iterdir()：面向对象的目录列表

()方法类似于()，但它返回一个迭代器，其中包含当前路径下所有文件和子目录的Path对象。这意味着你无需再手动拼接路径，因为返回的已经是完整的Path对象了。from pathlib import Path
import os
# 确保示例目录存在
dummy_folder = 'my_directory'
if not Path(dummy_folder).exists():
(dummy_folder)
with open(Path(dummy_folder) / '', 'w') as f: ('data')
(Path(dummy_folder) / 'data_folder')
with open(Path(dummy_folder) / 'data_folder' / '', 'w') as f: ('csv')
target_path = Path(dummy_folder) # 将字符串路径转换为 Path 对象
print(f"Listing contents of: {target_path}")
for item_path in ():
print(f"Item: {item_path}")
if item_path.is_file():
print(f" This is a file: {}")
elif item_path.is_dir():
print(f" This is a directory: {}")
# 输出示例:
# Listing contents of: my_directory
# Item: my_directory/
# This is a file:
# Item: my_directory/data_folder
# This is a directory: data_folder

3.3 glob() 和 rglob()：模式匹配的递归遍历

pathlib最强大的特性之一是它的glob()和rglob()方法，它们允许你使用shell风格的通配符（glob patterns）来匹配文件和目录。

(pattern)：在当前路径的直接子项中查找匹配pattern的文件和目录。

(pattern)：递归地在当前路径及其所有子目录中查找匹配pattern的文件和目录。

通配符说明：

*：匹配零个或多个字符。

?：匹配单个字符。

[abc]：匹配方括号中的任何一个字符。

：匹配任何目录、任何子目录（仅用于rglob()，表示递归）。

示例：from pathlib import Path
import os
import shutil
# 清理并创建示例目录结构
dummy_folder = 'pathlib_example'
if Path(dummy_folder).exists():
(dummy_folder) # 清理旧的目录
(Path(dummy_folder) / 'docs')
(Path(dummy_folder) / 'src' / 'models')
(Path(dummy_folder) / 'data')
with open(Path(dummy_folder) / '', 'w') as f: ('')
with open(Path(dummy_folder) / 'docs' / '', 'w') as f: ('')
with open(Path(dummy_folder) / 'docs' / '', 'w') as f: ('')
with open(Path(dummy_folder) / 'src' / '', 'w') as f: ('')
with open(Path(dummy_folder) / 'src' / '', 'w') as f: ('')
with open(Path(dummy_folder) / 'src' / 'models' / '', 'w') as f: ('')
with open(Path(dummy_folder) / 'data' / '', 'w') as f: ('')
with open(Path(dummy_folder) / 'data' / '', 'w') as f: ('')
base_path = Path(dummy_folder)
print(f"--- Using glob() in {base_path} (non-recursive) ---")
# 查找直接子目录下的所有 .md 文件
for f_path in ('*.md'):
print(f"Found .md file: {f_path}")
print(f"--- Using rglob() in {base_path} (recursive) ---")
# 查找所有 .py 文件，无论层级
for f_path in ('*.py'):
print(f"Found .py file: {f_path}")
# 查找所有以 'data' 开头的文件夹下的所有文件
print(f"--- Finding files in 'data*' directories ---")
for f_path in ('data*/*'): # 匹配 'data'开头的目录及其下的所有文件/目录
if f_path.is_file():
print(f"Found file in 'data*' dir: {f_path}")
# 查找所有以 'model_' 开头的 Python 文件
print(f"--- Finding 'model_*.py' files recursively ---")
for f_path in ('model_*.py'):
print(f"Found model file: {f_path}")
# 查找所有以 's' 开头的目录
print(f"--- Finding directories starting with 's' recursively ---")
for d_path in ('s*'): # 注意这里可能匹配到文件，需要手动过滤
if d_path.is_dir():
print(f"Found directory starting with 's': {d_path}")
# 遍历所有文件和文件夹（pathlib 版的 () 效果）
print(f"--- All files and directories (recursive) ---")
for item in ('*'): # /* 也可以，但 * 已经包含目录了
if item.is_file():
print(f" File: {item}")
elif item.is_dir():
print(f" Dir: {item}")
# 输出示例:
# --- Using glob() in pathlib_example (non-recursive) ---
# Found .md file: pathlib_example/
#
# --- Using rglob() in pathlib_example (recursive) ---
# Found .py file: pathlib_example/src/
# Found .py file: pathlib_example/src/
# Found .py file: pathlib_example/src/models/
#
# --- Finding files in 'data*' directories ---
# Found file in 'data*' dir: pathlib_example/data/
# Found file in 'data*' dir: pathlib_example/data/
# ... (其他类似输出)

四、实践技巧与最佳实践

4.1 错误处理：权限与不存在的路径

文件系统操作常常会遇到权限问题（PermissionError）或路径不存在（FileNotFoundError）。使用try-except块进行优雅的错误处理是好习惯。import os
from pathlib import Path
# os 模块的错误处理
try:
('/root/secret_folder') # 假设没有权限访问
except PermissionError as e:
print(f"Permission denied: {e}")
except FileNotFoundError as e:
print(f"Directory not found: {e}")
# pathlib 模块的错误处理 (() 也可能抛出 PermissionError/FileNotFoundError)
try:
p = Path('/non_existent_path')
for item in ():
pass # 不会执行到这里
except FileNotFoundError as e:
print(f"Pathlib: Directory not found: {e}")
# 可以使用 () 预检查
test_path = Path('/root/secret_folder')
if not ():
print(f"Pathlib: {test_path} does not exist.")
elif not test_path.is_dir():
print(f"Pathlib: {test_path} is not a directory.")
else:
try:
for item in ():
pass
except PermissionError as e:
print(f"Pathlib: Permission denied for {test_path}: {e}")

4.2 筛选文件与目录

在遍历过程中，我们经常需要根据特定条件筛选文件，例如按扩展名、大小、名称模式等。import os
from pathlib import Path
target_dir = Path('root_folder') # 假设这个目录已存在并有文件
# 1. 使用 () 筛选 .py 文件
print("--- Filtering .py files using () ---")
python_files_os = []
for root, _, files in (target_dir):
for file in files:
if ('.py'):
((root, file))
print(f"Python files (): {python_files_os}")
# 2. 使用 () 筛选 .py 文件 (更简洁)
print("--- Filtering .py files using () ---")
python_files_pathlib = list(('*.py'))
print(f"Python files (): {python_files_pathlib}")
# 3. 筛选大于特定大小的文件
print("--- Filtering files larger than 10 bytes ---")
large_files = []
for file_path in ('*'):
if file_path.is_file() and ().st_size > 10:
(file_path)
print(f"Large files: {large_files}")
# 4. 筛选包含特定关键字的文件名
print("--- Filtering files containing 'main' in name ---")
main_files = [f for f in ('*') if f.is_file() and 'main' in ]
print(f"Files with 'main': {main_files}")

4.3 性能考量

对于包含大量文件和深层目录的结构，性能可能成为问题。通常情况下：

()和()内部都经过了优化，它们是处理大型目录树的首选。

尽量使用生成器表达式而不是构建中间列表，尤其是在处理大量文件时，可以节省内存。

如果遍历仅用于查找特定文件或路径模式，()通常比手动过滤()的输出更简洁高效。

避免在循环中重复进行昂贵的I/O操作（如多次读取文件内容）。

4.4 选择合适的工具

()：只需要列出当前目录下的直接子项，且不关心文件类型时。

()：需要对整个目录树进行自上而下或自下而上的遍历，并且可能需要修改遍历行为（如跳过某些目录）时。对于文件和目录的名称，返回的是字符串。

()： ()的面向对象版本，返回Path对象，更方便后续链式操作。

() / ()：当你需要基于shell风格的通配符进行模式匹配来查找文件或目录时，这是最简洁和强大的选择。

五、总结

Python在文件系统遍历方面提供了强大而灵活的工具。os模块中的()是处理复杂目录树遍历的经典而强大的选择，它提供了细粒度的控制。而pathlib模块则以其面向对象的优雅设计，通过()、()和()等方法，极大地简化了路径操作和模式匹配。作为专业的程序员，我们应该根据具体的任务需求、代码的可读性以及对性能的考量，明智地选择最合适的工具。

掌握这些文件系统遍历的“艺术”，你将能够更高效、更优雅地处理各种与文件相关的编程任务，为你的自动化脚本和数据处理流程打下坚实的基础。

2025-10-11

上一篇：Python自动化Word文档处理：告别繁琐，提升效率的终极指南

下一篇：Python函数间高效协作：参数传递、返回值与高阶技巧深度解析