Using Large Language Models to Enrich the Documentation of Datasets for Machine Learning

Giner-Miguelez, Joan; Gómez, Abel; Cabot, Jordi

Computer Science > Digital Libraries

arXiv:2404.15320 (cs)

[Submitted on 4 Apr 2024 (v1), last revised 24 May 2024 (this version, v2)]

Title:Using Large Language Models to Enrich the Documentation of Datasets for Machine Learning

Authors:Joan Giner-Miguelez, Abel Gómez, Jordi Cabot

View PDF HTML (experimental)

Abstract:Recent regulatory initiatives like the European AI Act and relevant voices in the Machine Learning (ML) community stress the need to describe datasets along several key dimensions for trustworthy AI, such as the provenance processes and social concerns. However, this information is typically presented as unstructured text in accompanying documentation, hampering their automated analysis and processing. In this work, we explore using large language models (LLM) and a set of prompting strategies to automatically extract these dimensions from documents and enrich the dataset description with them. Our approach could aid data publishers and practitioners in creating machine-readable documentation to improve the discoverability of their datasets, assess their compliance with current AI regulations, and improve the overall quality of ML models trained on them.
In this paper, we evaluate the approach on 12 scientific dataset papers published in two scientific journals (Nature's Scientific Data and Elsevier's Data in Brief) using two different LLMs (GPT3.5 and Flan-UL2). Results show good accuracy with our prompt extraction strategies. Concrete results vary depending on the dimensions, but overall, GPT3.5 shows slightly better accuracy (81,21%) than FLAN-UL2 (69,13%) although it is more prone to hallucinations. We have released an open-source tool implementing our approach and a replication package, including the experiments' code and results, in an open-source repository.

Subjects:	Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
ACM classes:	H.4.4
Cite as:	arXiv:2404.15320 [cs.DL]
	(or arXiv:2404.15320v2 [cs.DL] for this version)
	https://doi.org/10.48550/arXiv.2404.15320

Submission history

From: Joan Giner-Miguelez [view email]
[v1] Thu, 4 Apr 2024 10:09:28 UTC (1,959 KB)
[v2] Fri, 24 May 2024 11:25:49 UTC (1,987 KB)

Computer Science > Digital Libraries

Title:Using Large Language Models to Enrich the Documentation of Datasets for Machine Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Digital Libraries

Title:Using Large Language Models to Enrich the Documentation of Datasets for Machine Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators