Skip to content

SFBioinformaticsGroup/paper2table

Repository files navigation

ReadTheDocs PyPI-Server

paper2table

Extract tables from papers

paper2table is a toolchain for extracting tabular information from scientific papers. It is composed of various command-line programs:

  • paper2table: the main command, which is used to extract data
  • filenorm: a command for preparing papers
  • tablemerge: a command for merging result of multiple paper2table runs
  • tablestats: a command for querying paper2table and tablemerge results
  • table2html: a command for generating simple extracted data visualizations
  • tablevalidate: a command for validating tables files.

Installing

# install base dependencies
$ pip install -e .

# install testing dependencies
$ pip install -e .[testing]

# install tox build tool
$ pip install tox

File preparation

Before running paper2table, it is recommended that you normalize your input papers files first, so that you avoid duplicate work. In order to do so, a small program filenorm is provided that will remove duplicate files and normalize filenames.

# normalize all the given files
# will ask for confirmation before each change
filenorm -q PATH [PATH ...]

# don't ask for confirmation. a log with each change will be printed
filenorm -y PATH [PATH ...]

Running

paper2table can read paper's table using three different backends:

  • the pdfplumber package (this is the default option)
  • the camelot package (this is the default option)
  • an external generative agent. This option is usually more robust, but slower, less deterministic and presents additional costs
# basic usage
$ paper2table -p SCHEMA PATH [PATH ...]

# e.g. use the default pdfplumber reader backend
$ paper2table -q tests/data/demo_table.pdf

# e.g. use the pdfplumber reader specifying column name hints
paper2table -r pdfplumber -c tests/data/demo_column_hints.txt  tests/data/demo_table.pdf

# e.g. use the camelot reader backend
$ paper2table -r camelot -q tests/data/demo_table.pdf

# by default paper2table outputs data to stdout
# but you can specify an output directory
$ paper2table -o .  tests/data/demo_table.pdf
# result will be stored in demo_table.tables.json

# e.g. use the agent backend with the Gemini API
$ GEMINI_API_KEY=... paper2table -r agent -m google-gla:gemini-2.5-flash -p tests/data/demo_schema.txt tests/data/demo_table.pdf

Merging

paper2table also provides a table merging program called tablemerge. In order to be able to use it, you'll need to first generate some metadata. You can produce it using the same paper2table command:

# this command will create a new directory with the resultset, adding a metadata file
# suitable for use with tablemerge command
$ paper2table -t -o tests/data/tables tests/data/demo_table.pdf

After doing this, you can merge tables like this:

$ tablemerge -o tests/data/merges tests/data/demo_resultsets/*

Generating stats

A tool tablestats is provided for getting some stats about the extracted tables. It can be used to query both the direct output of a paper2table run or the results of a tablemerge output.

# generate a json file with stats
tablestats -o tests/data/stats.json tests/data/demo_resultsets/08ba0033-8b20-4dbb-bf4a-e2be1f194bc7/

# pretty print stats to stdout
# you can optionally sort results by number of extracted tables
tablestats --sort desc tests/data/merges

# if you only need to output empty files, use --empty
# this is useful for debugging your results
tablestats --empty tests/data/merges

Visualizing data

A tool table2html is provided for displaying a resultset:

# it can be used both with the raw resultset of a paper2table run
# or with the output of tablemerge
table2html tests/data/merges

Running tests

$ tox

TablesFile format

paper2table and tablemerge command output the the extracted tables data in a TablesFile file format, (with extension .tables.json). You can validate that those files follow the exact format using tablevalidate:

tablevalidate tests/data/demo_resultsets/*/*

The format is informally specified this way:

{
  "tables": [
    {
      "rows": [
        {
          "COLUMN_NAME_1": string | [{ "value": string, "agreement_level": integer }],
          "COLUMN_NAME_2": string | [{ "value": string, "agreement_level": integer }],
          "COLUMN_NAME_3": string | [{ "value": string, "agreement_level": integer }],
          "agreement_level_": integer // this is optional
        }
      ],
      "page": integer,
    },
    {
      "table_fragments": [
        {
          "rows": ..., // same schema as previous "rows" attribute
          "page": integer
        }
      ]
    }
  ],
  "citation": string | [{ "value": string, "agreement_level": integer }],
  "metadata": { // optional
    "filename": string,
  }
}

You can also find a proper json schema definition in tablesfile.schema.json

About

Extract tables from papers

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages