Error thread 'polars' panicked when reading dataset using polars

Hello, I hope you’re well.

I’m writing to you because I’ve noticed a strange error when reading with polars from a dataset hosted on HF.

The code and dataset I use are as follows:

import polars as pl

dataframe = pl.read_parquet(
    "hf://datasets/louisbrulenaudet/code-voirie-routiere/data/train-00000-of-00001.parquet"
)

Whereas for another dataset, this code works very well:

import polars as pl

dataframe = pl.read_parquet(
    "hf://datasets/louisbrulenaudet/bofip/data/train-00000-of-00001.parquet"
)

Here’s the error message I’m getting, I don’t know whether it’s related to the automatic conversion to parquet or not:

at Function.wrapKernelMethodImpl (/Users/~/.vscode/extensions/ms-toolsai.jupyter-2024.8.2024080201-darwin-x64/dist/extension.node.js:304:82402)
09:26:01.466 [info] Process Execution: ~/.pyenv/versions/3.11.7/bin/python -c "import ipykernel; print(ipykernel.__version__); print("5dc3a68c-e34e-4080-9c3e-2a532b2ccb4d"); print(ipykernel.__file__)"
09:26:01.480 [info] Process Execution: ~/.pyenv/versions/3.11.7/bin/python -m ipykernel_launcher --f=/Users/~/Library/Jupyter/runtime/kernel-v2-7904DltItC7jYZG5.json
> cwd: ~/Desktop
09:26:02.722 [info] Restarted 002ef1e7-54fc-423e-9ef9-1d94beba5809
09:26:06.790 [error] Disposing session as kernel process died ExitCode: undefined, Reason: thread 'polars-7' panicked at crates/polars-parquet/src/parquet/read/compression.rs:222:17:
Found compressed page in the middle of the pages
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'polars-2' panicked at crates/polars-parquet/src/parquet/read/compression.rs:222:17:
Found compressed page in the middle of the pages
thread 'polars-6' panicked at crates/polars-parquet/src/parquet/read/compression.rs:222:17:

Thank you in advance, I remain available to test potential solutions.

Best regards, Louis

hi Louis,

I found other relevant issues
Polars: panic, found compressed page in the middle of the pages · Issue #4218 · posit-dev/positron · GitHub and
Reading multiple dictionary pages in Parquet file might lead to an exception · Issue #18061 · pola-rs/polars · GitHub.

First one claims that the issue comes from polars 1.4.0. Actually I couldn’t reproduce the issue with polars==1.5.0. Which version do you have? Can you please try to update it?

Second one(from polars) claims that parquet file format might be the issue. But I don’t think:

############ file meta data ############
created_by: parquet-cpp-arrow version 15.0.2
num_columns: 9
num_rows: 8583
num_row_groups: 9
format_version: 2.6
serialized_size: 69360
############ file meta data ############
created_by: parquet-cpp-arrow version 15.0.2
num_columns: 42
num_rows: 446
num_row_groups: 1
format_version: 2.6
serialized_size: 17922

I hope you can add a comment on Reading multiple dictionary pages in Parquet file might lead to an exception · Issue #18061 · pola-rs/polars · GitHub to help them to find the issue.

Sorry I just realized that the fix were merged. fix: Reading Parquet with Null dictionary page by coastalwhite · Pull Request #18112 · pola-rs/polars · GitHub

2 Likes

Everything is working perfectly, thank you very much for your prompt reply :hugs:

1 Like