-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
BUG: Validate path type in read_parquet, reject non-path/file-like #62979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| def test_read_parquet_valid_path_types(tmp_path, engine): | ||
| # GH #62922 | ||
| df = pd.DataFrame({"a": [1]}) | ||
| path = tmp_path / "test_read_parquet.parquet" | ||
| df.to_parquet(path, engine=engine) | ||
| # str | ||
| read_parquet(str(path), engine=engine) | ||
| # os.PathLike | ||
| read_parquet(pathlib.Path(path), engine=engine) | ||
| # file-like object | ||
| buf = BytesIO() | ||
| df.to_parquet(buf, engine=engine) | ||
| buf.seek(0) | ||
| read_parquet(buf, engine=engine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is not needed
| bad_path_types = [ | ||
| [str(path)], # list | ||
| (str(path),), # tuple | ||
| b"raw-bytes", # bytes | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing the list case is sufficient
| raise TypeError( | ||
| f"read_parquet expected str/os.PathLike or file-like object, " | ||
| f"got {type(path).__name__} type" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be done using get_handle in the _get_path_or_handle function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for the review!
I looked into _get_path_or_handle more closely.
get_handle is only invoked when we already know path_or_handle is a string and not a directory. For invalid inputs like a list, we never reach that branch as they get passed through stringify_path unchanged.
Also, _get_path_or_handle is only used in PyArrowImpl.read, not in FastParquetImpl.read. So if we only rely on _get_path_or_handle to validate input, validation coverage would be asymmetric across engines.
So, I propose to validate the path type in read_parquet before engine dispatch. Alternatively, I can factor a small _validate_parquet_path_arg(path) helper and call it at the top of both PyArrowImpl.read and FastParquetImpl.read
Let me know which placement you prefer.
Description
read_parquetdocumentation specifies that onlystr/os.PathLikeorfile-likeobjects are supported.Previously, passing unexpected container types such as list would reach backends and fail inconsistently.
This PR adds an early
TypeErrorcheck at the public API boundary to provide clear, consistent behaviour across pyarrow / fastparquet engines.Tests are added for both valid and invalid path types.