Skip to content

mrobee/pyparquet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python Parquet Reader

A minimalist Apache Parquet file reader implemented in Python that converts Parquet files to JSONL (JSON Lines) format.

Features

  • Reads Apache Parquet files
  • Supports common data types:
    • INT32
    • INT64
    • FLOAT
    • DOUBLE
    • BYTE_ARRAY
  • Handles dictionary encoding
  • Supports compression:
    • Uncompressed
    • GZIP
    • Snappy
  • Outputs to JSONL format

Prerequisites

  • Python 3.x
  • Required packages:
    • thrift
    • snappy

Installation

  1. Clone the repository
  2. Install dependencies:
pip install thrift
pip install python-snappy

Usage

python main.py input.parquet output.jsonl

Arguments

  • input.parquet: Path to the input Parquet file
  • output.jsonl: Path where the output JSONL file will be written

Example

python main.py iris.parquet iris.jsonl

This will convert the iris.parquet file to JSONL format, where each line is a JSON object representing one row of data.

Limitations

  • Currently supports only DataPageV1
  • Supports basic data types and encodings
  • Assumes flat schema structure
  • Limited to PLAIN and RLE_DICTIONARY encodings

Project Structure

pyparquet/
├── main.py          # Main implementation
├── parquet.thrift   # Thrift definition file
└── src/
    └── thrift/      # Generated Thrift code

License

This project is open source and available under the MIT License.

About

Python Parquet Reader

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors