Skip to content

Improve PDF and Office translation robustness#4

Merged
bogerman1 merged 18 commits into
mainfrom
cleanup/public-repo-dashboard-docs
May 30, 2026
Merged

Improve PDF and Office translation robustness#4
bogerman1 merged 18 commits into
mainfrom
cleanup/public-repo-dashboard-docs

Conversation

@bogerman1

@bogerman1 bogerman1 commented May 29, 2026

Copy link
Copy Markdown
Owner

Summary

This PR brings the current local DocHarbor translation robustness work into a reviewable branch.

  • adds a built-in PyMuPDF PDF gate fallback and routes digital-text PDFs through the Acrobat bridge DOCX path when available
  • improves Office image translation reliability: non-blocking image preflight in auto mode, retry diagnostics, connection headers, JPEG/PNG output normalization, and OCR-gate padding fallback for tightly cropped text images
  • adds Anthropic-compatible thinking budget controls while keeping thinking blocks out of returned translation text
  • translates PDF output filenames through the delivery title translation path
  • removes a temporary DOCX investigation document that contained local file-system paths and should not remain in the public repo
  • adds focused tests for PDF gate routing, Acrobat bridge fallback, image OCR gate behavior, image provider retries/normalization, output filename translation, and thinking-output stripping

Root Cause / Impact

Several production-like files exposed two weak spots: digital-text PDFs were still falling back to the scan/MinerU path when the external inspector was unavailable, and small tightly-cropped Office image snippets with source-language text were skipped by OCR. The changes keep the existing conservative paths but add safer fallbacks and better logs so the translation agent can continue without silently skipping important content.

Validation

  • python -m pytest tests/test_image_text_gate.py tests/test_image_translation_workflow.py tests/test_pdf_gate.py tests/test_office_delivery.py::test_run_office_delivery_translations_impl_routes_pdf tests/test_office_delivery.py::test_run_office_delivery_translations_impl_routes_digital_pdf_through_acrobat_bridge tests/test_office_delivery.py::test_run_office_delivery_translations_impl_falls_back_when_acrobat_bridge_fails tests/test_pdf_delivery.py::test_translate_pdf_document_translates_output_file_title tests/test_providers.py::test_anthropic_provider_supports_translation_thinking_effort tests/test_providers.py::test_anthropic_provider_does_not_mix_thinking_into_text -q -> 34 passed
  • after merging current main: python -m pytest tests/test_docx_delivery.py tests/test_image_text_gate.py tests/test_image_translation_workflow.py tests/test_pdf_gate.py tests/test_office_delivery.py::test_run_office_delivery_translations_impl_routes_digital_pdf_through_acrobat_bridge tests/test_pdf_delivery.py::test_translate_pdf_document_translates_output_file_title tests/test_providers.py::test_anthropic_provider_does_not_mix_thinking_into_text -q -> 37 passed
  • sensitive-string scan on staged/current diff found no customer/test-project strings or real API keys

Notes

Two unrelated local script edits remain unstaged locally and are not part of this PR.

@bogerman1 bogerman1 merged commit a495463 into main May 30, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant