An open-source Document Parser application that leverages:
- π FastAPI for the backend API
- π§ llama-index-core, Kotaemon, Markitdown for document understanding and parsing
- Celery for asynchronous task processing
- π§ Fully Dockerized environment with development and production modes
Built for intelligent document parsing at scale with modular design and API-first architecture.
- βοΈ Installation
- π Environment Setup
- π Running the Application
- π³ Run All Services with Docker
- π Makefile Commands
- π€ Contributing
- π License
git clone https://github.com/maowragvn-ai/document-task-parser.git
cd document-task-parser-
Unix/macOS:
python3 -m venv venv source venv/bin/activate -
Windows:
python -m venv venv .\venv\Scripts\activate
pip install -r requirements.txt- Copy the example environment file:
cp .env.example .env- Update
.envwith your API keys and database credentials:
GOOGLE_GENAI_USE_VERTEXAI=FALSE
GOOGLE_API_KEY=
TAVILY_API_KEY=
## Local dev DB
DB_USER=myuser
DB_PASSWORD=1
DB_HOST=postgres
DB_PORT=5432
DB_NAME=dvp_database
## Redis broker URL
CELERY_BROKER_URL=redis://redis:6379/0celery -A src.celery_worker worker --loglevel=info- PostgreSQL: localhost:5432
- Redis: localhost:6379
π‘ Use
init_db.shto initialize the database if needed.
./init_db.shuvicorn app_fastapi:app --host 0.0.0.0 --port 8000 --reload- API Docs: http://localhost:8000/docs
streamlit run app_streamlit.py --server.port=8501 --server.address=0.0.0.0alembic revision --autogenerate -m "message"
alembic upgrade headdocker compose -f docker-compose.dev.yaml builddocker compose -f docker-compose.dev.yaml up-
FastAPI backend: http://localhost:8000
-
Streamlit UI: http://localhost:8501
-
PostgreSQL: http://localhost:5432
-
Redis: http://localhost:6379
-
Healthcheck for Backend: http://localhost:8000/health
-
Healthcheck for all services:
make healthcheck-dev
make db-migrate-dev
make db-upgrade-devdocker-compose downdocker compose -f docker-compose.prod.yaml builddocker compose -f docker-compose.prod.yaml up -d- Healthcheck for Backend: http://localhost:8000/health
- Healthcheck for all services:
make healthcheck-prod
make db-migrate-prod
make db-upgrade-prodmake build-dev # Build dev images
make up-dev # Start dev containers
make deploy-dev # Rebuild and start containers
make healthcheck-dev # Check service health
make db-migrate-dev # Create DB migration with timestamp
make db-upgrade-dev # Apply latest DB migration
make bash-dev # Open shell inside backend container
make logs-dev # Show logsmake build-prod
make up-prod
make deploy-prod
make healthcheck-prod
make db-migrate-prod
make db-upgrade-prod
make bash-prod
make logs-prod- Amazone AWS s3 storage: Integrate with Amazon S3 for document storage.
- Streamlit UI: Enhance the Streamlit UI for a more user-friendly experience.
- Testing: Add comprehensive testing for the application.
- CI/CD: Set up CI/CD pipelines for automated testing and deployment.
- Monitoring: Implement monitoring and logging for better observability.
- Security: Enhance security measures, including authentication and authorization.
We welcome contributions of all kinds! Feel free to fork the repo, submit issues, or open a pull request.
This project is licensed under the MIT License.