Skip to content

Conversation

@fangchenli
Copy link
Member

@fangchenli fangchenli commented Apr 12, 2025

@mroeschke
Copy link
Member

FWIW I recall the team being negative in the past about supporting reading directories of files, and we document just concatting DataFrames read from a directory: https://pandas.pydata.org/docs/user_guide/cookbook.html#reading-multiple-files-to-create-a-single-dataframe. Are we sure we want to include this?

@datapythonista
Copy link
Member

FWIW I recall the team being negative in the past about supporting reading directories of files

Do you remember the reason? This seems like a useful thing, as I think it's common for some datasets to be split in different files with the same schema. And there is some added complexity to this, but it seems consistent with other syntactic sugar we have in IO operations such as decompressing, downloading, etc.

@datapythonista
Copy link
Member

Note that you've got the image from Will's book in this PR, this happened when we had to hard revert it from git history.

@fangchenli fangchenli changed the title [WIP] ENH: support reading directory in read_csv ENH: support reading directory in read_csv Jul 21, 2025
@fangchenli fangchenli requested a review from jbrockmendel August 8, 2025 19:11
@jbrockmendel
Copy link
Member

i think an unrelated file got added?

@fangchenli
Copy link
Member Author

i think an unrelated file got added?

Removed.

@fangchenli fangchenli requested a review from Copilot November 5, 2025 20:34
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for reading from directories in pandas.read_csv, read_table, and read_fwf, enabling users to process multiple CSV files from both local folders and remote locations via fsspec. The feature returns a generator that yields DataFrames (or TextFileReaders when using chunking/iterator mode) for each file in the directory.

  • Introduces iterdir() function to handle directory traversal for both local and remote paths
  • Extends read_csv, read_table, and read_fwf to accept directories and return generators
  • Updates error messages and exception types for better consistency and clarity

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pandas/io/common.py Adds iterdir() function and helper functions to support directory iteration for local and remote filesystems
pandas/io/parsers/readers.py Modifies _read() to handle directories and introduces _multi_file_generator() for yielding DataFrames from multiple files
pandas/tests/io/conftest.py Adds fixtures for testing local/remote CSV directories and files
pandas/tests/io/test_common.py Adds tests for the new iterdir() functionality and updates error message patterns
pandas/tests/io/test_fsspec.py Updates test to use fixture instead of hardcoded filename
pandas/tests/io/parser/test_directory.py New test file for directory reading functionality
pandas/tests/io/parser/test_compression.py Adds fixture for empty zip file and updates test to use it
pandas/tests/io/parser/test_unsupported.py Updates test to expect TypeError instead of ValueError for invalid file inputs
pandas/tests/io/parser/common/test_file_buffer_url.py Updates error messages and exception types for consistency
doc/source/whatsnew/v3.0.0.rst Documents the new directory reading feature

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1477 to +1483
path_obj = PurePosixPath(file["name"])
if _match_file(
path_obj,
extensions,
glob,
):
result.append(f"{scheme}://{path_obj}") # type: ignore[arg-type]
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Inconsistent naming: the variable path_obj is created from file[\"name\"] which is a string path, but the variable is already defined above (line 1466) for the single file case. Consider using a different variable name like file_path_obj to distinguish this from the earlier usage and improve readability.

Suggested change
path_obj = PurePosixPath(file["name"])
if _match_file(
path_obj,
extensions,
glob,
):
result.append(f"{scheme}://{path_obj}") # type: ignore[arg-type]
file_path_obj = PurePosixPath(file["name"])
if _match_file(
file_path_obj,
extensions,
glob,
):
result.append(f"{scheme}://{file_path_obj}") # type: ignore[arg-type]

Copilot uses AI. Check for mistakes.
- Support reading Stata 102-format (Stata 1) dta files (:issue:`58978`)
- Support reading Stata 110-format (Stata 7) dta files (:issue:`47176`)
- Switched wheel upload to **PyPI Trusted Publishing** (OIDC) for release-tag pushes in ``wheels.yml``. (:issue:`61718`)
- Added support for reading from directories in :func:`pandas.read_csv`, including local folders and remote locations via ``fsspec``
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whatsnew entry should mention all affected functions (read_csv, read_table, and read_fwf) or use 'CSV reading functions' for accuracy, as the changes apply to multiple reader functions.

Suggested change
- Added support for reading from directories in :func:`pandas.read_csv`, including local folders and remote locations via ``fsspec``
- Added support for reading from directories in CSV reading functions (:func:`pandas.read_csv`, :func:`pandas.read_table`, and :func:`pandas.read_fwf`), including local folders and remote locations via ``fsspec``

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

IO CSV read_csv, to_csv

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants