A minimalist Apache Parquet file reader implemented in Python that converts Parquet files to JSONL (JSON Lines) format.
- Reads Apache Parquet files
- Supports common data types:
- INT32
- INT64
- FLOAT
- DOUBLE
- BYTE_ARRAY
- Handles dictionary encoding
- Supports compression:
- Uncompressed
- GZIP
- Snappy
- Outputs to JSONL format
- Python 3.x
- Required packages:
thriftsnappy
- Clone the repository
- Install dependencies:
pip install thrift
pip install python-snappypython main.py input.parquet output.jsonlinput.parquet: Path to the input Parquet fileoutput.jsonl: Path where the output JSONL file will be written
python main.py iris.parquet iris.jsonlThis will convert the iris.parquet file to JSONL format, where each line is a JSON object representing one row of data.
- Currently supports only DataPageV1
- Supports basic data types and encodings
- Assumes flat schema structure
- Limited to PLAIN and RLE_DICTIONARY encodings
pyparquet/
├── main.py # Main implementation
├── parquet.thrift # Thrift definition file
└── src/
└── thrift/ # Generated Thrift code
This project is open source and available under the MIT License.