feat(api): add server-side PDF text extraction endpoint #2014

BurhanCantCode · 2025-10-30T19:06:00Z

Summary

Adds a new POST /extract/pdf endpoint to handle PDF text extraction on the server-side. This enables the React SDK to offload PDF processing to the backend, reducing client bundle size by ~2MB (removal of pdfjs-dist dependency).

Changes

✅ Add pdf-parse dependency (lightweight, 25KB vs 2MB pdfjs-dist)
✅ Create ExtractPdfDto and ExtractPdfResponseDto following NestJS patterns
✅ Add extractPdfText endpoint in ExtractorController with file validation
- Validates file type (application/pdf)
- Validates file size (max 50MB)
- Returns extracted text and page count
✅ Add unit tests for PDF file validation (size/type checks)
✅ Follow existing audio.controller.ts pattern for consistency

API Endpoint

POST /extract/pdf Content-Type: multipart/form-data Request:
file: PDF file (max 50MB)
Response: { "text": "extracted text content...", "pages": 5 }

Testing

✅ 3 unit tests added (file size validation, file type validation, valid file acceptance)
✅ All tests passing
✅ Follows NestJS best practices and project conventions

Add POST /extract/pdf endpoint to handle PDF text extraction on the server. This allows the React SDK to offload PDF processing, reducing client bundle size by ~2MB (removal of pdfjs-dist dependency). Changes: - Add pdf-parse dependency (lightweight, 25KB) - Create ExtractPdfDto and ExtractPdfResponseDto following NestJS patterns - Add extractPdfText endpoint in ExtractorController with file validation - Add unit tests for PDF file validation (size/type checks) - Follow existing audio.controller.ts pattern for consistency 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Resolve package-lock.json conflicts by regenerating from package.json. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…tion feat(api): add server-side PDF text extraction endpoint

vercel · 2025-10-30T19:06:07Z

@BurhanCantCode is attempting to deploy a commit to the tambo ai Team on Vercel.

A member of the Team first needs to authorize it.

michaelmagan · 2025-10-31T17:05:40Z

Hey @BurhanCantCode! Thanks for the PR. I reviewed the server-side PDF extraction implementation. After discussing internally, we do need to update our Tambo Cloud API, but we don't need any new API routes. Instead, we need to extend our current image handling to support more file types, using S3 instead of storing in the database.

Many of the Frontier models just support passing file types. File types to support (from OpenAI, Anthropic, and Gemini):

Common across all three providers:

Documents: PDF, TXT, HTML, JSON, MD
Images: JPG/JPEG, PNG, GIF, WEBP
Spreadsheets: CSV

Additional formats:

DOCX, PPTX (OpenAI & Anthropic)
RTF, ODT (Anthropic & Gemini)
TSV, XLSX (Gemini)
Code files (.py, .js, .java, .cpp, etc.)

Implementation approach:

Frontend: Update the component to start uploading files to S3 immediately when attached
Upload handling: Ensure messages wait for uploads to complete before sending
Message content: Pass the message type and S3 location in the message content
API side: Update existing logic to fetch from S3 and pass to the LLM provider (similar to current image handling) - no new endpoints needed
Configuration: Allow engineers to specify which file types they want to support (since different models support different sets)
Documentation: Update docs to explain supported file types per model

No need for pdf-parse or any extraction logic - the LLM providers handle that themselves.

Here are the relevant docs for supported file types:
OpenAI File Search Supported Files
Anthropic Files API
Gemini Document Processing

Could you refactor this to follow the S3 upload pattern instead of adding a new extraction endpoint?

BurhanCantCode and others added 3 commits October 26, 2025 19:02

chore: merge main into feat/server-side-pdf-extraction

6d00bfb

Resolve package-lock.json conflicts by regenerating from package.json. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Merge pull request #1 from BurhanCantCode/feat/server-side-pdf-extrac…

dbdbac0

…tion feat(api): add server-side PDF text extraction endpoint

BurhanCantCode closed this Nov 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(api): add server-side PDF text extraction endpoint #2014

feat(api): add server-side PDF text extraction endpoint #2014

Uh oh!

BurhanCantCode commented Oct 30, 2025

Uh oh!

vercel bot commented Oct 30, 2025

Uh oh!

michaelmagan commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(api): add server-side PDF text extraction endpoint #2014

feat(api): add server-side PDF text extraction endpoint #2014

Uh oh!

Conversation

BurhanCantCode commented Oct 30, 2025

Summary

Changes

API Endpoint

Testing

Related

Uh oh!

vercel bot commented Oct 30, 2025

Uh oh!

michaelmagan commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants