Skip to content

KiranFiles/onboardIQ

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OnboardIQ

AI-Powered New Hire Knowledge Accelerator built on Azure OpenAI + pgvector.

OnboardIQ ingests your company's M365 content such as SharePoint documents, Teams meeting transcripts, and OneNote notebooks, then answers new-hire questions with grounded, source-cited responses.


Project Structure

onboardiq/
|-- app/                  # FastAPI backend
|   |-- api/routes.py     # REST endpoints
|   |-- config/           # pydantic-settings
|   |-- core/             # DB engine, ORM models
|   |-- ingestion/        # chunker, embedder, Graph API client
|   |-- migrations/       # Alembic + pgvector indexes/constraints
|   `-- query/            # retriever, answer generator
`-- frontend/             # React + Vite + Tailwind UI
    `-- src/
        |-- components/
        |   |-- ChatView.tsx
        |   |-- IngestView.tsx
        |   `-- Sidebar.tsx
        |-- api.ts
        `-- App.tsx

Backend State

Implemented

Area File(s) Status
FastAPI app + lifespan app/main.py Complete
Settings via pydantic-settings app/config/settings.py Complete
Async DB engine manager app/core/database.py Complete
ORM model (DocumentChunk) app/core/orm.py Complete
Alembic migrations (pgvector + unique chunk constraint) app/migrations/ Complete
Text chunker (token-based with tiktoken) app/ingestion/chunker.py Complete
Embedding service (retry + upsert support) app/ingestion/embedder.py Complete
Graph API client (OAuth2 token flow + demo fallback) app/ingestion/graph_client.py Partial
pgvector cosine retriever app/query/retriever.py Complete
GPT-4o answer generator (JSON confidence output) app/query/generator.py Complete
REST API routes (/api/ingest, /api/query, /api/health) app/api/routes.py Complete

Remaining Gaps

  1. MS Graph content ingestion (graph_client.py) OAuth2 token retrieval is implemented, but SharePoint file enumeration, content download, and Teams transcript retrieval still fall back to synthetic demo data.

  2. Background ingestion queue (routes.py /ingest) Ingestion now runs in-process for local development. For production-scale workloads, move it to ARQ, Celery, or Azure Queue Storage.

  3. Prompt tuning (generator.py _SYSTEM_PROMPT) The generator now returns structured JSON confidence, but the prompt can still be improved with few-shot examples for stronger grounding quality.


Setup

Prerequisites

  • Python 3.11+
  • uv package manager
  • Node.js 18+ and pnpm
  • PostgreSQL 15+ with the pgvector extension enabled
  • Azure OpenAI resource with text-embedding-3-large and gpt-4o deployments

The easiest way to run PostgreSQL locally is via Docker:

docker run -d \
  --name onboardiq-pg \
  -e POSTGRES_USER=postgres \
  -e POSTGRES_PASSWORD=postgres \
  -e POSTGRES_DB=onboardiq \
  -p 5432:5432 \
  pgvector/pgvector:pg16

1. Install backend dependencies

Run this from the project root:

uv sync

2. Configure environment

cp .env.example .env
# Edit .env with your credentials

Key variables:

Variable Example
DATABASE_URL postgresql://postgres:postgres@localhost:5432/onboardiq
LLM_PROVIDER azure or openai or gemini
AZURE_OPENAI_ENDPOINT https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY your key (if using azure provider)
AZURE_OPENAI_EMBEDDING_DEPLOYMENT embedding deployment name (default: text-embedding-3-large)
OPENAI_API_KEY your key (if using openai provider)
GEMINI_API_KEY your key (if using gemini provider)
AZURE_TENANT_ID your Azure tenant id (for Graph token flow)
AZURE_CLIENT_ID your Azure app client id
AZURE_CLIENT_SECRET your Azure app client secret

3. Run database migrations

uv run alembic -c app/alembic.ini upgrade head

This creates the tables, vector support, and the unique constraint used for chunk upserts.

4. Start the API server

uv run uvicorn app.main:app --reload

API: http://localhost:8000

Docs: http://localhost:8000/docs

Health: http://localhost:8000/api/health

5. Start the frontend

cd frontend
pnpm install
pnpm dev

Frontend: http://localhost:3000

The Vite dev server proxies /api/* requests to http://localhost:8000.


Quick Test

  1. Start PostgreSQL and make sure .env contains real values, not placeholders. At minimum, DATABASE_URL and the credentials for your selected LLM_PROVIDER must be valid.

  2. Install dependencies:

uv sync
  1. Apply migrations:
uv run alembic -c app/alembic.ini upgrade head
  1. Start the API:
uv run uvicorn app.main:app --reload
  1. Verify the server:
  • http://localhost:8000/
  • http://localhost:8000/api/health
  • http://localhost:8000/docs
  1. Test ingestion in Swagger with POST /api/ingest:
{
  "source_type": "sharepoint",
  "site_id": "demo-site"
}
  1. Test querying in Swagger with POST /api/query:
{
  "question": "When is the benefits enrollment deadline for new hires?"
}

Troubleshooting

  • If uv run alembic upgrade head fails with No 'script_location' key found in configuration, use uv run alembic -c app/alembic.ini upgrade head. This repo stores alembic.ini inside app/.

  • If http://localhost:8000/health returns 404, use http://localhost:8000/api/health. The API router is mounted under /api.

  • If POST /api/ingest returns 500, check the terminal running uvicorn. The most common causes are:

    • DATABASE_URL still points to placeholder values
    • Azure/OpenAI credentials are still placeholders
    • database migrations were not applied

Architecture

Ingestion Pipeline

MS Graph API / demo content
        |
        v
graph_client.py         (token flow + source loading)
        |
        v
chunker.py              (split into token-bounded chunks with overlap)
        |
        v
embedder.py             (embeddings + pgvector upsert)
        |
        v
PostgreSQL + pgvector   (store chunks + vector search index)

Query Pipeline

User question (POST /api/query)
        |
        v
embedder.embed_text()   (embed question)
        |
        v
retriever.search()      (cosine ANN search)
        |
        v
generator.generate()    (grounded answer + structured confidence)
        |
        v
QueryResponse           (answer, sources, confidence)

Next Steps

  1. Implement SharePoint file enumeration and content extraction via Microsoft Graph.
  2. Implement Teams transcript retrieval via Microsoft Graph.
  3. Move /ingest to a background task queue such as ARQ.
  4. Add authentication middleware (Azure AD JWT validation) for production use.
  5. Write integration tests against a local pgvector Docker container.
  6. Set up Azure Container Apps or AKS deployment with Helm charts.

About

AI-Powered New Hire Knowledge Accelerator built on Azure OpenAI + pgvector. OnboardIQ ingests your company's M365 content such as SharePoint documents, Teams meeting transcripts, and OneNote notebooks, then answers new-hire questions with grounded, source-cited responses.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors