Skip to content

RFC111 AI/LLM tool policy: significant revision, limiting drastically their use#14500

Merged
rouault merged 15 commits into
OSGeo:masterfrom
rouault:rfc111_revision
May 18, 2026
Merged

RFC111 AI/LLM tool policy: significant revision, limiting drastically their use#14500
rouault merged 15 commits into
OSGeo:masterfrom
rouault:rfc111_revision

Conversation

@rouault
Copy link
Copy Markdown
Member

@rouault rouault commented May 6, 2026

No description provided.

Copy link
Copy Markdown
Contributor

@lnicola lnicola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(placeholder commend because GitHub threw a tantrum)

@lnicola
Copy link
Copy Markdown
Contributor

lnicola commented May 6, 2026

Argh, I can't comment. Regarding the "not copyrightable" part:

I'm not convinced this is true. I suspect it refers to a common misinterpretation of Thaler v. Perlmutter. Code developed by an LLM with substantial input from a human is likely still copyrightable. Thaler specifically tried to disclaim any contribution to the work discussed there.

Comment thread doc/source/community/ai_tool_policy.rst Outdated
Warning
-------
Commit messages and pull request messages must be fully written by the author,
besides potential translation to English and typo/grammar fixing. This is
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll see how this develops, but I suspect it will encourage contributors to post LLM-(re)written walls of text under the pretense of fixing the grammar.

@lnicola
Copy link
Copy Markdown
Contributor

lnicola commented May 6, 2026

Typo: "The content must be written by a human. Use of AI/LL tool for translation"

@lnicola
Copy link
Copy Markdown
Contributor

lnicola commented May 6, 2026

with the general principle that there must be a human in the loop

I think this is a bit misleading as part of the summary. If I prompt an LLM to make a focused change, review the code, drop half of it because it's useless, I'll argue that I'm very much in the loop. But this is still disallowed under the new policy.

A different approach I've seen some projects take is to require contributors to understand (or be able to explain) their changes. No opinion on how useful this is in practice.

@rouault
Copy link
Copy Markdown
Member Author

rouault commented May 6, 2026

I think this is a bit misleading as part of the summary. If I prompt an LLM to make a focused change, review the code, drop half of it because it's useless, I'll argue that I'm very much in the loop. But this is still disallowed under the new policy.

Please propose a best wording

@rouault rouault force-pushed the rfc111_revision branch from f79090c to e487fb0 Compare May 6, 2026 18:20
@rouault
Copy link
Copy Markdown
Member Author

rouault commented May 6, 2026

Typo: "The content must be written by a human. Use of AI/LL tool for translation"

fixed

@lnicola
Copy link
Copy Markdown
Contributor

lnicola commented May 6, 2026

Please propose a best wording

Maybe

  • courts (in particular the US ones) have not definitely determined whether LLM outputs are derived works of the training data, or whether LLM-written code can even be copyrighted by a human

… their use

Co-authored-by: Laurențiu Nicola <lnicola@dend.ro>
@rouault rouault force-pushed the rfc111_revision branch from e487fb0 to 79bb580 Compare May 6, 2026 18:29
@rouault
Copy link
Copy Markdown
Member Author

rouault commented May 6, 2026

  • courts (in particular the US ones) have not definitely determined whether LLM outputs are derived works of the training data, or whether LLM-written code can even be copyrighted by a human

thanks, adopted

Comment thread doc/source/community/ai_tool_policy.rst Outdated
@rouault
Copy link
Copy Markdown
Member Author

rouault commented May 7, 2026

changed "there must be a human in the loop" to "the human must be the (primary) author"

Copy link
Copy Markdown
Collaborator

@elpaso elpaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just left a few minor remarks but I fully agree with the proposal.

Comment thread doc/source/community/ai_tool_policy.rst Outdated
Comment thread doc/source/community/ai_tool_policy.rst Outdated
Comment thread doc/source/community/ai_tool_policy.rst Outdated
Co-authored-by: Alessandro Pasotti <elpaso@itopen.it>
@ldesousa
Copy link
Copy Markdown

ldesousa commented May 7, 2026

If you allow me to intrude into this discussion. The European Directive on the copyright of computer programmes
limits protection to the "author’s own intellectual creation". It also states clearly that only a "person or group of people" can hold copyright.

Beyond copyright, if you believe the CRA will ever be enforced, then you should assume distribution of a programme whose source is not understood by a legal entity responsible for its distribution or manufacturing will no longer be legal.

@lnicola
Copy link
Copy Markdown
Contributor

lnicola commented May 7, 2026

It also states clearly that only a "person or group of people" can hold copyright.

That matches Thaler v. Perlmutter. The human holds copyright, not the LLM. This has been a very popular strawman among anti-AI activists lately.

The European Directive on the copyright of computer programmes
limits protection to the "author’s own intellectual creation".

If I implement a merge sort, it's not meaningfully my intellectual creation, but nobody will argue that I've stolen the work of von Neumann.

But if I ask an LLM to implement an external merge sort for my large table, you'll say that I have no intellectual contribution, and it's not copyrightable.

@ldesousa
Copy link
Copy Markdown

ldesousa commented May 7, 2026

That matches Thaler v. Perlmutter.

That is going on in the US, I guess they don't care much about European directives over there. Also note that in the US there is the figure of "Public Domain" which does not exist in the EU. When eventually something of the like reaches a European court, the decision will between assigning copyright to the legal entity responsible for the LLM and that owning copyright over the training data.

Beyond that, you are conflating copyright of a computer programme with the copyright of an algorithm.

@gdt
Copy link
Copy Markdown
Contributor

gdt commented May 7, 2026

That is going on in the US, I guess they don't care much about European directives over there. Also note that in the US there is the figure of "Public Domain" which does not exist in the EU. When eventually something of the like reaches a European court, the decision will between assigning copyright to the legal entity responsible for the LLM and that owning copyright over the training data.

Projecting out the observation that different legal jurisdictions have different rules, I interpret your comment as agreeing that the situation is unclear now and that there is no basis for confident predictions about how it will be resolved, if ever. Thus I would expect you would approve of the suggested text from @lnicola

@lnicola
Copy link
Copy Markdown
Contributor

lnicola commented May 8, 2026

the decision will between assigning copyright to the legal entity responsible for the LLM and that owning copyright over the training data

Or to the LLM user. There's been some conflicting statements from the EU, but extrapolating from what I've seen, the people working in those institutions tend heavy users of US LLMs. I doubt they'll legislate that LLMs infringe on the copyright of others.

@jedbrown
Copy link
Copy Markdown

jedbrown commented May 8, 2026

I doubt they'll legislate that LLMs infringe on the copyright of others.

A German court ruled last fall that OpenAI infringed copyright from training data (and OpenAI tried to throw users under the bus for prompting, which the court rejected). Note that LLMs can emit entire books verbatim, thousand-word passages via commercial models [demo], and with organic prompting. The question is not whether LLMs are capable of infringing copyright of the training data, but who will be liable for that infringement and what due diligence would be necessary to mitigate the risk to levels that a project can tolerate (and I guess, whether the project adopts the smol bean hypothesis that the project and its users are not worth suing).

@lnicola
Copy link
Copy Markdown
Contributor

lnicola commented May 8, 2026

Note that LLMs can emit entire books verbatim

As long as you put part of the book in the input prompt.

@jedbrown
Copy link
Copy Markdown

jedbrown commented May 8, 2026

As long as you put part of the book in the input prompt.

A few words (in that study), but see also the organic prompting study, the German case, and others such as typing //sparse matrix transpose<TAB> and getting a page of near-verbatim code, which is still working its way through the courts. There is no simple procedure to ensure that output is non-infringing.

Comment thread doc/source/community/ai_tool_policy.rst Outdated
rouault and others added 2 commits May 12, 2026 16:03
Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
Comment thread doc/source/community/ai_tool_policy.rst Outdated
@rouault rouault marked this pull request as ready for review May 13, 2026 01:48
Comment thread doc/source/community/ai_tool_policy.rst
Comment thread doc/source/community/ai_tool_policy.rst Outdated
Comment thread doc/source/community/ai_tool_policy.rst Outdated
Comment thread doc/source/community/ai_tool_policy.rst Outdated
rouault and others added 2 commits May 13, 2026 19:03
Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
Comment thread doc/source/community/ai_tool_policy.rst Outdated
Comment thread doc/source/community/ai_tool_policy.rst Outdated
Co-authored-by: Kyle Barron <kylebarron2@gmail.com>
Comment thread doc/source/community/ai_tool_policy.rst Outdated
behavior to play nicely with AI will involve natural selection over many generations,
be prepared for not getting a very warm welcome if you misuse those tools.
You have been warned!
* Submission of `vibe-coded <https://en.wikipedia.org/wiki/Vibe_coding>`__ contributions is *banned*.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that an explicit definition of what consitutes "vibe coding" should be included directly in this policy rather than as a reference to Wikipedia.

There are probably also several degrees of "vibe coding".
A developer might start a work item as "vibe coding" and then spend several hours or days refining the code (with various degrees of AI assistance) to the point that the problems inherent to "vibe coding" may no longer be problems?

I am really concerned that a blanket ban on vibe-coding for open-source projects are problematic, as closed-source solutions would likely allow whatever achieves best development velocity (still taking maintenance cost into consideration).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that an explicit definition of what consitutes "vibe coding" should be included directly in this policy rather than as a reference to Wikipedia.

We may argue about the exact definition, but fundamentally using LLMs is an anti-social / anti-ethical behavior in the context of open source software were maintainers are still human. That make them even more the contention point that before. As raised by @dbaston https://lists.osgeo.org/pipermail/gdal-dev/2026-May/061636.html we review code from others in the hope that a small percentage of them will grow as maintainers and will take a bit of that burden. Reviewing LLM generated code is just helping tech giants improve their model and doesn't grow any new maintainer.

as closed-source solutions would likely allow whatever achieves best development velocity (still taking maintenance cost into consideration).

By that metrics, closed-source solutions have always been "better" than us because they can align more developers. Let them do their thing and we'll talk again about the end result they've achieved in 5 years

Why is velocity so important ? Are there so many missing features in GDAL that they need to be rushed ?
Code added is more a long-term liability than an asset.

If velocity of development is so important then people can contribute to Oxigdal, and it is written in Rust. Or create their forked GDAL with agents reviewing & automatically merging code.

By the way, it is great to see so many people reviewing this PR. I'd wish they'd do the same for the more "boring" ones 😜

Copy link
Copy Markdown
Contributor

@jerstlouis jerstlouis May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fundamentally using LLMs is an anti-social / anti-ethical behavior in the context of open source software where maintainers are still human

I strongly disagree with that statement (and indicated as much in the OSGeo charter member survey about LLM usage), but I fully understand where you're coming from as a key maintainer with an unbelievably large number of things to review.

Reviewing LLM generated code is just helping tech giants improve their model and doesn't grow any new maintainer.

While this is true, this should also improve the quality of future PRs.

Isn't maintainers productivity also improved by LLMs?

Why is velocity so important ?

Long term if a closed-source solution becomes much more capable than an open-source alternative, the recent open-source adoption gain trend might reverse. It's also about being able to do more with one's time and being able to make more contributions (which can still be of a certain quality).

I'd wish they'd do the same for the more "boring" ones

I think part of the solution to get more maintainers may actually be through LLM usage for performing human-in-the-loop reviews.

I am not against rejecting contributions where there's clearly no human in the loop, but I think "vibe coding" can actually be a very efficient way get started on implementing new features in particular (even if careful self-review / refinements should be essential).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going back to the original question of definitions, I would argue that by the time we come up with a good definition for vibe coding, it will already be obsolete. We should also appreciate how quickly this landscape is evolving and treat any policies we come up with as living documents.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adopting those new toys under the pressure of a feeling of urgency would be a terrible idea. That's exactly the narrative that tech giants want to instigate in our minds. My personal opinion regarding e.g. adopting a new programming language has been "let's way 15 years and see if it is still there" (Rust is almost at that point :-))

GDAL has always been a very conservative project in terms of tech adoption (I have to fight each time I want to bump the C++ version!). I guess if it is still there after 28 years, this must be part of the reason of its success.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In that case I suggest dropping the term "vibe coding" altogether and focus directly on LLM assist. The wording used by Oracle is a good starting point, making it clear which activities are allowed and those which are not.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has been a terrible experience and showed me that it would ultimately lead to burn out if becoming the norm.

I really believe the solution for this should not be banning LLM assistance.

A possible solution is to have parallel tracks of pull requests for different level of LLM assistance.

Perhaps indicating the amount of human time invested by the contributor(s) in preparing a PR.

Each maintainer could decide which of these tracks they invest how much efforts in.

Copy link
Copy Markdown
Member Author

@rouault rouault May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each maintainer could decide which of these tracks they invest how much efforts in.

I haven't heard about any existing GDAL maintainer who was keen in reviewing LLM generated PRs. As a proof of that, notice the 3 ones flagged as such at the bottom of https://github.com/OSGeo/gdal/pulls that have been sitting there for many weeks. It is disingenuous to say to potential contributors that they may use LLM assisted coding if there are no maintainers willing to review such PR.

GDAL is already a much too big beast compared to the size of its maintainer community. We don't need more code coming from people interested in drive-by contributions.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was an initial version of this policy, at it was confirmed that is was not enough. If I remember correctly it was "softened" after some discussions. Now we have to harden it.

It has been explained many times: the reviewing and maintenance resources are very, very limited. The experience is that throwing LLM generated code saturates those resource. So we have to cut this. Immediately. Otherwise the project CANNOT CONTINUE.

GDAL is not in a hurry to implement anything. It is mature, stable and well known (and good quality). And these qualities are valued out there, by open and proprietary code.

For sure in some time we will review this policy. Maybe to make it even harder, or softer. Let's see. But now we have to act to protect the project and its maintainers.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://dl.acm.org/doi/full/10.1145/3807518 might be interesting to this discussion. They offer their own definition, and explain why, based on current literature, vibe coding in large, long-lived projects can be problematic.

Additionally, legal systems across the world (including US and EU) have not
definitely determined whether LLM outputs are derived works of training data or
if LLM-written code can even be copyrighted by a human. This is despite it
being latently extracted and originated from open source software in the first
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "latently" the right word here? Possibly should be "largely"?

@rouault
Copy link
Copy Markdown
Member Author

rouault commented May 18, 2026

Adopted with +1 from PSC members KurtS, NormanB, MikeS, JavierJS, DanB, HowardB, JukkaR, DanielM and EvenR.

@rouault rouault merged commit 314bb15 into OSGeo:master May 18, 2026
2 checks passed
rouault added a commit that referenced this pull request May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.