Skip to content

thiagomarini/pdftotext

 
 

Repository files navigation

PDF to Text Lambda Function

AWS Nodejs Lambda function to read text from PDFs. It's configured to work behind API Gateway.

CircleCI License: MIT

How it works

The Lambda function receives a POST request with the URL of the PDF, downloads it to /tmp and then executes the pdftotext binary file in the root to extract the PDF's text.

For local development, tests and pipelines you need to install poppler, run brew install poppler if using mac.

How to use

You will need to install AWS console, Serverless and Yarn on your machine

Run serverless config to initialise the project.

Run yarn to install dependencies.

Use serverless.yaml to define your routes, handlers, environment variables and domains

To invoke locally use serverless invoke local -f pdfToText --log --path __tests__/components/sample-event.json

Deploy

serverless deploy --stage <stage>

Testing

It's configured to use jest.

A sample API Gateway event is in the __tests__ folder.

Run yarn test to run the test. There's only one functional test in there.

Pipelines

On Circle CI pipelines you'll need AWS credentials to be able to deploy. And as mentioned on How it works you'll need poppler on your docker image, the one on .circleci/config.yml already has it.

About

AWS Lambda Function build on Serverless that extracts text from a PDF URL

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • JavaScript 100.0%