Skip to content

aula-id/pdf-tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pdfnano

Pull the plots, figures, and diagrams out of a PDF and save each one as its own PNG image.

Point it at a datasheet or a research paper, and you get a folder of cropped images: every graph, every figure, every embedded picture, as a separate file. It works on a regular computer, no graphics card needed.

It runs two ways:

  • Command line — type one command, get a folder of images.
  • Web page — run one command, open your browser, drag in a PDF, click a button.

This guide assumes you have never used a terminal before. Follow it line by line.


What you need first

  • A computer running Linux (this tool does not run on Windows or macOS).
  • Python 3.9 or newer. Most Linux systems already have it.

Everything else gets installed automatically in the steps below.

Open a terminal

Press Ctrl + Alt + T. A dark window opens where you type commands. That window is "the terminal." Every command in this guide is typed there, and you press Enter after each one.

Check that Python is installed

Type this and press Enter:

python3 --version

If you see something like Python 3.10.12, you are good. The number just needs to be 3.9 or higher.

If instead you see command not found, install Python by running:

sudo apt update
sudo apt install -y python3 python3-venv python3-pip

It will ask for your password. Type it (you will not see the characters as you type, that is normal) and press Enter.


Install pdfnano from source

"From source" means you download the code and set it up yourself. Five steps.

Step 1 — Get the code

If you already have the project folder on your computer, skip to Step 2.

Otherwise, download it with git:

git clone https://github.com/aula-id/pdf-nano.git

If git is not installed, run sudo apt install -y git first, then run the clone command again.

Step 2 — Go into the project folder

cd pdf-nano

cd means "change directory" — it moves you into the folder. If your folder has a different name, use that name instead of pdf-nano.

Step 3 — Make a private workspace for Python

This creates a clean, isolated space so pdfnano's parts do not interfere with the rest of your system. It is called a "virtual environment."

python3 -m venv .venv

This makes a hidden folder named .venv. You only do this once.

Step 4 — Turn the workspace on

source .venv/bin/activate

After this, your terminal line starts with (.venv). That tag means the workspace is active.

Important: every time you open a new terminal to use pdfnano, you must run this source command again first. If pdfnano ever says "command not found," this is almost always the reason — you forgot to turn the workspace on.

Step 5 — Install pdfnano

pip install -e .

This downloads the libraries pdfnano needs and sets up the pdfnano command. It takes a minute or two the first time. When it finishes, check that it worked:

pdfnano --version

You should see pdfnano 0.1.0. That is it — pdfnano is installed.


Using pdfnano (command line)

The basic command

pdfnano extract yourfile.pdf

Replace yourfile.pdf with the name of your PDF. If the PDF is in a different folder, write the full path, for example:

pdfnano extract /home/you/Downloads/datasheet.pdf

By default the images are saved into a new folder called graphs next to where you ran the command. Open that folder in your file browser to see the results.

Choose where the images go

pdfnano extract yourfile.pdf --out my_images

This saves into a folder named my_images instead.

Only certain pages

If you only want pages 3 and 4:

pdfnano extract yourfile.pdf --pages 3,4

A range works too — pages 3 through 6:

pdfnano extract yourfile.pdf --pages 3-6

What gets extracted

Out of the box, pdfnano pulls everything visual it can find on each page:

  • Plots — graphs with axes (current vs. voltage, that kind of thing).
  • Figures — diagrams and drawings made of lines and shapes.
  • Images — pictures embedded in the PDF.

You do not need to turn anything on. It is all on by default.

Turning things off

If you only want the axis graphs and nothing else:

pdfnano extract yourfile.pdf --no-figures --no-images
  • --no-figures skips diagrams and drawings.
  • --no-images skips embedded pictures.

How the files are named

For graphs that have a printed code under them (common on datasheets, like IT16804), the file is named after that code: IT16804.png.

pdfnano figures out the code pattern from the PDF on its own — you do not have to tell it anything. If a graph has no code, the file is named by its position on the page instead, like p3_r1c2.png (page 3, row 1, column 2). Figures and pictures are named like p3_fig1.png and p3_img1.png.

If you would rather always use position-based names and ignore codes:

pdfnano extract yourfile.pdf --figid-pattern ''

Using pdfnano (web page)

If typing commands is not your thing, use the built-in web page.

pdfnano web

You will see:

pdfnano web GUI running at http://127.0.0.1:8000/
Press Ctrl+C to stop.

Open a web browser and go to http://127.0.0.1:8000 (you can copy and paste that). You will see a simple page where you:

  1. Choose your PDF file.
  2. Optionally type which pages you want.
  3. Click Extract.

The extracted images show up right on the page, and you can click any of them to download it.

When you are done, go back to the terminal and press Ctrl + C to stop the web page.

If port 8000 is already used by something else, pick another number:

pdfnano web --port 9000

Then open http://127.0.0.1:9000 instead.


PDFs you cannot open or copy from

Some PDFs are locked so you cannot copy or extract their contents (you can still open and read them, but tools get blocked). pdfnano can unlock a working copy on the fly:

pdfnano extract locked.pdf --decrypt

For this to work you need a small helper program called qpdf. Install it once:

sudo apt install -y qpdf

pdfnano does not change your original file — it makes a temporary unlocked copy, uses it, and deletes it afterward.


Using pdfnano from your own Python code

If you write Python, you can call pdfnano directly:

from pdfnano import extract_graphs

results = extract_graphs("datasheet.pdf", pages=[3, 4], out="figs")

for r in results:
    print(r["path"], r["width"], r["height"])

extract_graphs returns a list, one entry per image, each a dictionary with the file path, page number, size, and a few other details.


When something goes wrong

pdfnano: command not found Your workspace is off. Run source .venv/bin/activate from inside the project folder, then try again.

Nothing found. pdfnano did not detect anything on the pages you asked for. Double-check the page numbers, or add --debug to save a marked-up image showing what it looked at:

pdfnano extract yourfile.pdf --debug

The debug images land in your output folder. Red boxes are what it picked; thin blue boxes are everything it considered.

The wrong things got extracted, or too many. Turn off what you do not want with --no-figures and --no-images, or narrow down with --pages.

See every available option

pdfnano extract --help

Turning the workspace off

When you are completely finished, you can turn the workspace off with:

deactivate

The (.venv) tag disappears. Next time, just cd back into the folder and run source .venv/bin/activate again.


Installing with pip (coming later)

A simpler one-command install (pip install pdfnano) is planned. This section will be filled in once it is published.

About

working on pdf things... extended from just plumbers

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages