Skip to content

SakanaAI/edinet2dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

12 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

edinet2dataset

๐Ÿ“š Paper | ๐Ÿ“ Blog | ๐Ÿ“ Dataset | ๐Ÿง‘โ€๐Ÿ’ป Code

edinet2dataset is a tool to construct financial datasets using EDINET.

edinet2dataset has two classes to build Japanese financial dataset using EDINET.

  • Downloader: Download financial reports of Japanese listed companies using the EDINET API.
  • Parser: Extract key items such as the balance sheet (BS), cash flow statement (CF), profit and loss statement (PL), summary, and text from the downloaded TSV reports.

edinet2dataset is used to construct EDINET-Bench, a challenging Japanese financial benchmark dataset.

Installation

Install the dependencies using uv.

uv sync

To use EDINET-API, configure your EDINET-API key in a .env file. Please refer to the official documentation to obtain the API key.

Basic Usage

  • Search for a company name using a substring match query.
$ python src/edinet2dataset/downloader.py --query ใƒˆใƒจใ‚ฟ
ๆๅ‡บ่€…ๅ ๏ผฅ๏ผค๏ผฉ๏ผฎ๏ผฅ๏ผดใ‚ณใƒผใƒ‰ ๆๅ‡บ่€…ๆฅญ็จฎ
ใƒˆใƒจใ‚ฟ็ดก็น”ๆ ชๅผไผš็คพ E00540 ่ผธ้€็”จๆฉŸๅ™จ
ใƒˆใƒจใ‚ฟ่‡ชๅ‹•่ปŠๆ ชๅผไผš็คพ E02144 ่ผธ้€็”จๆฉŸๅ™จ
ใƒˆใƒจใ‚ฟใƒ•ใ‚กใ‚คใƒŠใƒณใ‚นๆ ชๅผไผš็คพ E05031 ใ‚ตใƒผใƒ“ใ‚นๆฅญ
ใƒˆใƒจใ‚ฟ ใƒขใƒผใ‚ฟใƒผ ใ‚ฏใƒฌใ‚ธใƒƒใƒˆ ใ‚ณใƒผใƒใƒฌใƒผใ‚ทใƒงใƒณ E05904 ๅค–ๅ›ฝๆณ•ไบบใƒป็ต„ๅˆ
ใƒˆใƒจใ‚ฟ ใƒ•ใ‚กใ‚คใƒŠใƒณใ‚น ใ‚ชใƒผใ‚นใƒˆใƒฉใƒชใ‚ข ใƒชใƒŸใƒ†ใƒƒใƒ‰ E05954 ๅค–ๅ›ฝๆณ•ไบบใƒป็ต„ๅˆ
ใƒˆใƒจใ‚ฟ ใƒขใƒผใ‚ฟใƒผ ใƒ•ใ‚กใ‚คใƒŠใƒณใ‚น๏ผˆใƒใ‚ถใƒผใƒฉใƒณใ‚บ๏ผ‰ใƒ“ใƒผใƒ–ใ‚ค E20989 ๅค–ๅ›ฝๆณ•ไบบใƒป็ต„ๅˆ
ใƒˆใƒจใ‚ฟใƒ•ใ‚กใ‚คใƒŠใƒณใ‚ทใƒฃใƒซใ‚ตใƒผใƒ“ใ‚นๆ ชๅผไผš็คพ E23700 ๅ†…ๅ›ฝๆณ•ไบบใƒป็ต„ๅˆ๏ผˆๆœ‰ไพก่จผๅˆธๅ ฑๅ‘Šๆ›ธ็ญ‰ใฎๆๅ‡บ็พฉๅ‹™่€…ไปฅๅค–๏ผ‰
  • Download the annual report submitted by Toyota Motor Corporation for the period from June 1, 2024, to June 28, 2024.
$ uv run python src/edinet2dataset/downloader.py --start_date 2024-06-01 --end_date 2024-06-28 --company_name "ใƒˆใƒจใ‚ฟ่‡ชๅ‹•่ปŠๆ ชๅผไผš็คพ" --doc_type annual  
Downloading documents (2024-06-01 - 2024-06-28): 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 28/28 [00:02<00:00,  9.76it/s]
  • Extract balance sheet (BS) items from the annual report.
$ uv run python src/edinet2dataset/parser.py --file_path data/E02144/S100TR7I.tsv --category_list BS
2025-04-26 22:03:16.026 | INFO     | __main__:parse_tsv:130 - Found 2179 unique elements in data/E02144/S100TR7I.tsv
{'็พ้‡‘ๅŠใณ้ ้‡‘': {'Prior1Year': '2965923000000', 'CurrentYear': '4278139000000'}, '็พ้‡‘ๅŠใณ็พ้‡‘ๅŒ็ญ‰็‰ฉ': {'Prior2Year': '6113655000000', 'Prior1Year': '1403311000000', 'CurrentYear': '9412060000000'}, 'ๅฃฒๆŽ›้‡‘': {'Prior1Year': '1665651000000', 'CurrentYear': '1888956000000'}, 'ๆœ‰ไพก่จผๅˆธ': {'Prior1Year': '1069082000000', 'CurrentYear': '3938698000000'}, 'ๅ•†ๅ“ๅŠใณ่ฃฝๅ“': {'Prior1Year': '271851000000', 'CurrentYear': '257113000000'}

Reproduce EDINET-Bench

You can reproduce EDINET-Bench by running following commands.

Note

Since only the past 10 years of annual reports are available via the EDINET API, the time window used to construct the dataset shifts with each execution. As a result, datasets generated at different times may not be identical.

Construct EDINET-Corpus

Download all annual reports for the year 2024.

$ python scripts/prepare_edinet_corpus.py --doc_type annual --start_date 2024-01-01 --end_date 2025-01-01

Download securities reports spanning 10 years for approximately 4,000 companies from EDINET.

$ bash edinet_corpus.sh

Note

Please be careful not to send too many requests in parallel, as downloading reports from the past 10 years could place a significant load on EDINET.

You will get the following directories

edinet_corpus
โ”œโ”€โ”€ annual
โ”‚   โ”œโ”€โ”€ E00004
โ”‚   โ”‚   โ”œโ”€โ”€ S1005SBA.json
โ”‚   โ”‚   โ”œโ”€โ”€ S1005SBA.pdf
โ”‚   โ”‚   โ”œโ”€โ”€ S1005SBA.tsv
โ”‚   โ”‚   โ”œโ”€โ”€ S1008JYI.json
โ”‚   โ”‚   โ”œโ”€โ”€ S1008JYI.pdf
โ”‚   โ”‚   โ”œโ”€โ”€ S1008JYI.tsv

Construct Accounting Fraud Detection Task

Build a benchmark to detect accounting fraud in the securities report of a given fiscal year.

$ python scripts/fraud_detection/prepare_fraud.py
$ python scripts/fraud_detection/prepare_nonfraud.py
$ python scripts/fraud_detection/prepare_dataset.py

You can analyze the amended report classified as fraud-related by running the following command:

$ python scripts/fraud_detection/analyze_fraud_explanation.py 

Construct Earnings Forecasting Task

Build a benchmark to forecast the following yearโ€™s profit based on the securities report of a given fiscal year.

$ python  scripts/profit_forecast/prepare_dataset.py 

Construct Industry Prediction Task

Buid a benchmark to predict industry given an annual report.

$ python scripts/industry_prediction/prepare_dataset.py 

Citation

@inproceedings{
sugiura2026edinetbench,
title={{EDINET}-Bench: Evaluating {LLM}s on Complex Financial Tasks using Japanese Financial Statements},
author={Issa Sugiura and Takashi Ishida and Taro Makino and Chieko Tazuke and Takanori Nakagawa and Kosuke Nakago and David Ha},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=Dxns0cj15A}
}

Acknowledgement

We acknowledge edgar-crawler as an inspiration for our tool. We also thank EDINET, which served as the primary resource for constructing our benchmark.

About

edinet2dataset is a tool to construct financial dataset using EDINET.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors