-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
ENH: support reading directory in read_csv #61275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
FWIW I recall the team being negative in the past about supporting reading directories of files, and we document just concatting DataFrames read from a directory: https://pandas.pydata.org/docs/user_guide/cookbook.html#reading-multiple-files-to-create-a-single-dataframe. Are we sure we want to include this? |
Do you remember the reason? This seems like a useful thing, as I think it's common for some datasets to be split in different files with the same schema. And there is some added complexity to this, but it seems consistent with other syntactic sugar we have in IO operations such as decompressing, downloading, etc. |
|
Note that you've got the image from Will's book in this PR, this happened when we had to hard revert it from git history. |
…ad-csv-from-directory
|
i think an unrelated file got added? |
Removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for reading from directories in pandas.read_csv, read_table, and read_fwf, enabling users to process multiple CSV files from both local folders and remote locations via fsspec. The feature returns a generator that yields DataFrames (or TextFileReaders when using chunking/iterator mode) for each file in the directory.
- Introduces
iterdir()function to handle directory traversal for both local and remote paths - Extends
read_csv,read_table, andread_fwfto accept directories and return generators - Updates error messages and exception types for better consistency and clarity
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| pandas/io/common.py | Adds iterdir() function and helper functions to support directory iteration for local and remote filesystems |
| pandas/io/parsers/readers.py | Modifies _read() to handle directories and introduces _multi_file_generator() for yielding DataFrames from multiple files |
| pandas/tests/io/conftest.py | Adds fixtures for testing local/remote CSV directories and files |
| pandas/tests/io/test_common.py | Adds tests for the new iterdir() functionality and updates error message patterns |
| pandas/tests/io/test_fsspec.py | Updates test to use fixture instead of hardcoded filename |
| pandas/tests/io/parser/test_directory.py | New test file for directory reading functionality |
| pandas/tests/io/parser/test_compression.py | Adds fixture for empty zip file and updates test to use it |
| pandas/tests/io/parser/test_unsupported.py | Updates test to expect TypeError instead of ValueError for invalid file inputs |
| pandas/tests/io/parser/common/test_file_buffer_url.py | Updates error messages and exception types for consistency |
| doc/source/whatsnew/v3.0.0.rst | Documents the new directory reading feature |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| path_obj = PurePosixPath(file["name"]) | ||
| if _match_file( | ||
| path_obj, | ||
| extensions, | ||
| glob, | ||
| ): | ||
| result.append(f"{scheme}://{path_obj}") # type: ignore[arg-type] |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Inconsistent naming: the variable path_obj is created from file[\"name\"] which is a string path, but the variable is already defined above (line 1466) for the single file case. Consider using a different variable name like file_path_obj to distinguish this from the earlier usage and improve readability.
| path_obj = PurePosixPath(file["name"]) | |
| if _match_file( | |
| path_obj, | |
| extensions, | |
| glob, | |
| ): | |
| result.append(f"{scheme}://{path_obj}") # type: ignore[arg-type] | |
| file_path_obj = PurePosixPath(file["name"]) | |
| if _match_file( | |
| file_path_obj, | |
| extensions, | |
| glob, | |
| ): | |
| result.append(f"{scheme}://{file_path_obj}") # type: ignore[arg-type] |
| - Support reading Stata 102-format (Stata 1) dta files (:issue:`58978`) | ||
| - Support reading Stata 110-format (Stata 7) dta files (:issue:`47176`) | ||
| - Switched wheel upload to **PyPI Trusted Publishing** (OIDC) for release-tag pushes in ``wheels.yml``. (:issue:`61718`) | ||
| - Added support for reading from directories in :func:`pandas.read_csv`, including local folders and remote locations via ``fsspec`` |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whatsnew entry should mention all affected functions (read_csv, read_table, and read_fwf) or use 'CSV reading functions' for accuracy, as the changes apply to multiple reader functions.
| - Added support for reading from directories in :func:`pandas.read_csv`, including local folders and remote locations via ``fsspec`` | |
| - Added support for reading from directories in CSV reading functions (:func:`pandas.read_csv`, :func:`pandas.read_table`, and :func:`pandas.read_fwf`), including local folders and remote locations via ``fsspec`` |
…li/pandas into read-csv-from-directory
read_csvbodo-ai/Bodo-Pandas-Collaboration#2doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.