Extract tables from papers
paper2table is a toolchain for extracting tabular information from scientific papers. It is composed of various command-line programs:
paper2table: the main command, which is used to extract datafilenorm: a command for preparing paperstablemerge: a command for merging result of multiplepaper2tablerunstablestats: a command for queryingpaper2tableandtablemergeresultstable2html: a command for generating simple extracted data visualizationstablevalidate: a command for validating tables files.
# install base dependencies
$ pip install -e .
# install testing dependencies
$ pip install -e .[testing]
# install tox build tool
$ pip install toxBefore running paper2table, it is recommended that you normalize your input papers files first, so that you avoid duplicate work. In order to do so, a small program filenorm
is provided that will remove duplicate files and normalize filenames.
# normalize all the given files
# will ask for confirmation before each change
filenorm -q PATH [PATH ...]
# don't ask for confirmation. a log with each change will be printed
filenorm -y PATH [PATH ...]paper2table can read paper's table using three different backends:
- the pdfplumber package (this is the default option)
- the camelot package (this is the default option)
- an external generative agent. This option is usually more robust, but slower, less deterministic and presents additional costs
# basic usage
$ paper2table -p SCHEMA PATH [PATH ...]
# e.g. use the default pdfplumber reader backend
$ paper2table -q tests/data/demo_table.pdf
# e.g. use the pdfplumber reader specifying column name hints
paper2table -r pdfplumber -c tests/data/demo_column_hints.txt tests/data/demo_table.pdf
# e.g. use the camelot reader backend
$ paper2table -r camelot -q tests/data/demo_table.pdf
# by default paper2table outputs data to stdout
# but you can specify an output directory
$ paper2table -o . tests/data/demo_table.pdf
# result will be stored in demo_table.tables.json
# e.g. use the agent backend with the Gemini API
$ GEMINI_API_KEY=... paper2table -r agent -m google-gla:gemini-2.5-flash -p tests/data/demo_schema.txt tests/data/demo_table.pdfpaper2table also provides a table merging program called tablemerge. In order to be able to use it, you'll need to first generate some metadata. You can produce it using the
same paper2table command:
# this command will create a new directory with the resultset, adding a metadata file
# suitable for use with tablemerge command
$ paper2table -t -o tests/data/tables tests/data/demo_table.pdfAfter doing this, you can merge tables like this:
$ tablemerge -o tests/data/merges tests/data/demo_resultsets/*A tool tablestats is provided for getting some stats about the extracted tables. It can be used to query both the direct output of
a paper2table run or the results of a tablemerge output.
# generate a json file with stats
tablestats -o tests/data/stats.json tests/data/demo_resultsets/08ba0033-8b20-4dbb-bf4a-e2be1f194bc7/
# pretty print stats to stdout
# you can optionally sort results by number of extracted tables
tablestats --sort desc tests/data/merges
# if you only need to output empty files, use --empty
# this is useful for debugging your results
tablestats --empty tests/data/mergesA tool table2html is provided for displaying a resultset:
# it can be used both with the raw resultset of a paper2table run
# or with the output of tablemerge
table2html tests/data/merges$ toxpaper2table and tablemerge command output the the extracted tables data in a TablesFile file format,
(with extension .tables.json). You can validate that those files follow the exact format using tablevalidate:
tablevalidate tests/data/demo_resultsets/*/*The format is informally specified this way:
{
"tables": [
{
"rows": [
{
"COLUMN_NAME_1": string | [{ "value": string, "agreement_level": integer }],
"COLUMN_NAME_2": string | [{ "value": string, "agreement_level": integer }],
"COLUMN_NAME_3": string | [{ "value": string, "agreement_level": integer }],
"agreement_level_": integer // this is optional
}
],
"page": integer,
},
{
"table_fragments": [
{
"rows": ..., // same schema as previous "rows" attribute
"page": integer
}
]
}
],
"citation": string | [{ "value": string, "agreement_level": integer }],
"metadata": { // optional
"filename": string,
}
}You can also find a proper json schema definition in tablesfile.schema.json