Skip to content

maowragvn-ai/document-task-parser

Repository files navigation

πŸ“„ document-task-parser

An open-source Document Parser application that leverages:

  • πŸš€ FastAPI for the backend API
  • 🧠 llama-index-core, Kotaemon, Markitdown for document understanding and parsing
  • Celery for asynchronous task processing
  • πŸ”§ Fully Dockerized environment with development and production modes

Built for intelligent document parsing at scale with modular design and API-first architecture.


πŸ“š Table of Contents


βš™οΈ Installation

1. Clone the repository

git clone https://github.com/maowragvn-ai/document-task-parser.git
cd document-task-parser

2. (Optional) Create a virtual environment

  • Unix/macOS:

    python3 -m venv venv
    source venv/bin/activate
  • Windows:

    python -m venv venv
    .\venv\Scripts\activate

3. Install dependencies

pip install -r requirements.txt

πŸ” Environment Setup

  1. Copy the example environment file:
cp .env.example .env
  1. Update .env with your API keys and database credentials:
GOOGLE_GENAI_USE_VERTEXAI=FALSE
GOOGLE_API_KEY=
TAVILY_API_KEY=

## Local dev DB
DB_USER=myuser
DB_PASSWORD=1
DB_HOST=postgres
DB_PORT=5432
DB_NAME=dvp_database

## Redis broker URL
CELERY_BROKER_URL=redis://redis:6379/0

πŸš€ Running the Application

1. Start Celery Worker

celery -A src.celery_worker worker --loglevel=info

2. Ensure PostgreSQL and Redis are running

πŸ’‘ Use init_db.sh to initialize the database if needed.

./init_db.sh

3. Start the FastAPI backend

uvicorn app_fastapi:app --host 0.0.0.0 --port 8000 --reload

4. (Optional) Start Streamlit frontend (currently not supported)

streamlit run app_streamlit.py --server.port=8501 --server.address=0.0.0.0

5. Run DB migrations

alembic revision --autogenerate -m "message"
alembic upgrade head

🐳 Run all service with Docker

1. Development mode

Build images

docker compose -f docker-compose.dev.yaml build

Start containers

docker compose -f docker-compose.dev.yaml up

DB migrations

make db-migrate-dev
make db-upgrade-dev

Stop containers

docker-compose down

2. Production mode

Build images

docker compose -f docker-compose.prod.yaml build

Start containers

docker compose -f docker-compose.prod.yaml up -d

DB migrations

make db-migrate-prod
make db-upgrade-prod

πŸ›  Make File CMD

Development

make build-dev        # Build dev images
make up-dev           # Start dev containers
make deploy-dev       # Rebuild and start containers
make healthcheck-dev  # Check service health
make db-migrate-dev   # Create DB migration with timestamp
make db-upgrade-dev   # Apply latest DB migration
make bash-dev         # Open shell inside backend container
make logs-dev         # Show logs

Production

make build-prod
make up-prod
make deploy-prod
make healthcheck-prod
make db-migrate-prod
make db-upgrade-prod
make bash-prod
make logs-prod

Future Improvements

  • Amazone AWS s3 storage: Integrate with Amazon S3 for document storage.
  • Streamlit UI: Enhance the Streamlit UI for a more user-friendly experience.
  • Testing: Add comprehensive testing for the application.
  • CI/CD: Set up CI/CD pipelines for automated testing and deployment.
  • Monitoring: Implement monitoring and logging for better observability.
  • Security: Enhance security measures, including authentication and authorization.

πŸ“š References

Libraries and Frameworks:

Opensource Projects:


🀝 Contributing

We welcome contributions of all kinds! Feel free to fork the repo, submit issues, or open a pull request.


πŸ“„ License

This project is licensed under the MIT License.

About

An open-source Document Parser application with fully dockerized environment that leverages: FastAPI for the backend API, llama-index-core, Kotaemon, Markitdown for document understanding and parsing, Celery for asynchronous task processing

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages