This is a monorepo containing both the frontend and backend of the Ivory project.
- Docker
- Docker Compose
- Node.js 18+ (for local development)
- Python 3.11+ (for local development)
- Clone the repository:
git clone https://github.com/yourusername/ivory.git
cd ivory
- Start the services:
docker-compose up -d
- Access the applications:
- Frontend: http://localhost:3000
- Backend: http://localhost:8000/api/v1
The application uses DuckDB for data storage, which is automatically initialized when the container starts.
- Navigate to the web directory:
cd web
- Install dependencies:
yarn install
- Start the development server:
yarn dev
- Navigate to the backend directory:
cd src
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Initialize the database:
python init_db.py
- Start the server:
python main.py
web/
- Next.js frontend applicationsrc/
- Python backend application- Uses DuckDB for data storage
- Prepared for future PostgreSQL integration for authentication
docker-compose.yml
- Docker Compose configuration
The project currently uses DuckDB for data storage, with the database file located at src/datasets.db
. The database is persisted using Docker volumes.
The project is designed to support PostgreSQL integration in the future, particularly for:
- User authentication
- Session management
- Additional data storage needs
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This is the canonical way to query datasets. The backend reads Parquet files directly via DuckDB using a safe JSON query spec — no SQL from the frontend, no row-copying into relational tables.
- Enable persistent DuckDB cache (optional): set
IVORY_USE_TABLE_INDEX=1
- API version endpoint:
GET /api/v1/meta/version
- Preview dataset schema:
GET /api/v1/query/preview/{dataset}
- Run a query:
POST /api/v1/query/run
Example payload:
{
"dataset": "my_dataset",
"select": ["text", "_hf_split"],
"where": [{"column": "text", "op": "contains", "value": "example"}],
"order_by": {"column": "_hf_split", "direction": "asc"},
"limit": 50
}
Labels are managed per dataset/label name and are stored in SQLite files under datasets/<dataset>/labels/
.
- Upsert by text:
POST /api/v1/query/label/upsert
- Upsert by row id (preferred):
POST /api/v1/query/label/upsert_row
Notes:
- New ingests include a stable
__row_id
column in Parquet for consistent joins and label/embedding alignment. - Legacy ORM-backed endpoints will be deprecated; use the JSON query API for reads.
If you have existing Parquet files without __row_id
, run:
python tools/backfill_row_ids.py --root datasets
Set this env var on the backend to return 410 for ORM-backed read endpoints:
IVORY_DISABLE_ORM_READS=1