-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
BUG: Validate path type in read_parquet, reject non-path/file-like #62979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
YukunR
wants to merge
1
commit into
pandas-dev:main
Choose a base branch
from
YukunR:bug/read_parquet
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+45
−0
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -248,6 +248,42 @@ def check_partition_names(path, expected): | |
| assert dataset.partitioning.schema.names == expected | ||
|
|
||
|
|
||
| def test_read_parquet_invalid_path_types(tmp_path, engine): | ||
| # GH #62922 | ||
| df = pd.DataFrame({"a": [1]}) | ||
| path = tmp_path / "test_read_parquet.parquet" | ||
| df.to_parquet(path, engine=engine) | ||
|
|
||
| bad_path_types = [ | ||
| [str(path)], # list | ||
| (str(path),), # tuple | ||
| b"raw-bytes", # bytes | ||
| ] | ||
|
Comment on lines
+257
to
+261
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Testing the list case is sufficient |
||
| for bad in bad_path_types: | ||
| match = ( | ||
| f"read_parquet expected str/os.PathLike or file-like object, " | ||
| f"got {type(bad).__name__} type" | ||
| ) | ||
| with pytest.raises(TypeError, match=match): | ||
| read_parquet(bad, engine=engine) | ||
|
|
||
|
|
||
| def test_read_parquet_valid_path_types(tmp_path, engine): | ||
| # GH #62922 | ||
| df = pd.DataFrame({"a": [1]}) | ||
| path = tmp_path / "test_read_parquet.parquet" | ||
| df.to_parquet(path, engine=engine) | ||
| # str | ||
| read_parquet(str(path), engine=engine) | ||
| # os.PathLike | ||
| read_parquet(pathlib.Path(path), engine=engine) | ||
| # file-like object | ||
| buf = BytesIO() | ||
| df.to_parquet(buf, engine=engine) | ||
| buf.seek(0) | ||
| read_parquet(buf, engine=engine) | ||
|
Comment on lines
+271
to
+284
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This test is not needed |
||
|
|
||
|
|
||
| def test_invalid_engine(df_compat, temp_file): | ||
| msg = "engine must be one of 'pyarrow', 'fastparquet'" | ||
| with pytest.raises(ValueError, match=msg): | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be done using
get_handlein the_get_path_or_handlefunction.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for the review!
I looked into
_get_path_or_handlemore closely.get_handleis only invoked when we already knowpath_or_handleis a string and not a directory. For invalid inputs like alist, we never reach that branch as they get passed throughstringify_pathunchanged.Also,
_get_path_or_handleis only used inPyArrowImpl.read, not inFastParquetImpl.read. So if we only rely on_get_path_or_handleto validate input, validation coverage would be asymmetric across engines.So, I propose to validate the path type in
read_parquetbefore engine dispatch. Alternatively, I can factor a small_validate_parquet_path_arg(path)helper and call it at the top of bothPyArrowImpl.readandFastParquetImpl.readLet me know which placement you prefer.