Skip to content

Conversation

@wolfv
Copy link
Contributor

@wolfv wolfv commented Apr 12, 2024

Fill in the technical details of the OCI registry storage for reference.

@wolfv
Copy link
Contributor Author

wolfv commented Apr 12, 2024

  • we should link refernce impls.
  • we should say that conda / mamba / rattler use oci:// as the scheme.

cep-oci.md Outdated

A conda package, in an OCI registry, should ship up to 3 layers:

- The package itself, as a tarball. (mandatory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the contents of the package, or the compressed artifact (tarball or not)? In .conda the outer layer is actually a ZIP file, and the inner ones are zstd tarballs. Would be helpful to clarify.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, should leave the tarball out of the sentence. It's the package data itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, some artifacts in defaults are available both as .tar.bz2 and .conda. Should we restrict one package layer for each label? I don't think we need to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything would work fine if there are two layers, they should just not have the same mediaType.

cep-oci.md Outdated

For example, a package like `xtensor-0.10.4-h431234.conda` would map to a OCI registry `conda-forge/linux-64/xtensor:0.10.4-h431234`.

### Layers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A table with the different layer mediaTypes and the expected contents would make this section super easy to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to format and push to this PR! :) This is a collaborative document

@jaimergp
Copy link
Contributor

we should say that conda / mamba / rattler use oci:// as the scheme.

What does a oci:// URL look like? Is it a direct translation of the anaconda.org artifact that gets manipulated? How is the server referred to? Basically, which one is correct:

  • oci://ghcr.io/channel-mirrors/conda-forge/linux-64/xtensor:0.10.4-h431234
  • oci://ghcr.io/channel-mirrors/conda-forge/linux-64/xtensor-0.10.4-h431234
  • oci://ghcr.io/channel-mirrors/conda-forge/linux-64/xtensor-0.10.4-h431234.tar.bz2

Is it possible to refer to a layer directly by URL? In case of repodata:

  • oci://ghcr.io/channel-mirrors/conda-forge/linux-64/repodata.json.zstd
  • oci://ghcr.io/channel-mirrors/conda-forge/linux-64/repodata.json:zstd

I think this needs to be standardized too.


We want to use OCI registries as a storage for conda packages. This CEP specifies how we lay out conda packages on an OCI registry.

## Specification
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distribution-spec contains useful definitions for these terms:

https://github.com/opencontainers/distribution-spec/blob/v1.0/spec.md#definitions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I linked it :)

cep-oci.md Outdated

- for a .tar.bz2 package, the mediaType is `application/vnd.conda.package.v1`
- for a .conda package, the mediaType is `application/vnd.conda.package.v2`
- for the `info` folder as gzip the mediaType is `application/vnd.conda.info.v1.tar+gzip`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The info folder can get heavy with licenses, test files and what not. Will we allow .tar+zstd, for example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we currently have uploaded the whole conda-forge with the gzip one, I would suggest to keep it like that for now. We could move to a zstd encoded one in the future.

Or we could just allow for any encoding (+gzip or +zstd)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to be able to use the unmodified info-pip-23.3.1-py312haa95532_0.tar.zst out of the .conda

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I used the tar.gz approach because of how easy it is to open and explore it with pure Python. But the unmodified one makes sense as well. WE could specify that application/vnd.conda.info.v1.tar+zst will also be accepted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.
.tar.gz just seemed odd although gzip is very ordinary it hasn't been used in conda packaging.

wolfv and others added 2 commits April 12, 2024 17:50
Co-authored-by: Matthew R. Becker <beckermr@users.noreply.github.com>
Co-authored-by: jaimergp <jaimergp@users.noreply.github.com>
@wolfv
Copy link
Contributor Author

wolfv commented Apr 12, 2024

Yeah, the nice thing is that the URLs map directly to the OCI registry. There just needs to be some post-processing for the tags.

E.g. instead of

https://conda.anaconda.org/conda-forge/linux-64/repodata.json

We ask for

oci://ghcr.io/channel-mirrors/conda-forge/linux-64/repodata.json:latest

And for packages

https://conda.anaconda.org/conda-forge/linux-64/numpy-1.23.1-h123123.tar.bz2

We ask for

oci://ghcr.io/channel-mirrors/conda-forge/linux-64/numpy:1.23.1-h123123

@wolfv
Copy link
Contributor Author

wolfv commented Apr 12, 2024

One thing that also needs to go into this document is how tags are formatted. Some versions cannot be translated to tags directly because of the rules of OCI registries.

The functions are here:

https://github.com/mamba-org/rattler/blob/a2073aa6b92196c50208d39fc6b6c67469bf7810/crates/rattler_networking/src/oci_middleware.rs#L77-L92

And when a package name starts with _ (also illegal on OCI registry) then we map it to _ -> zzz: https://github.com/mamba-org/rattler/blob/a2073aa6b92196c50208d39fc6b6c67469bf7810/crates/rattler_networking/src/oci_middleware.rs#L158-L161

@jaimergp
Copy link
Contributor

Food for thought: implement subdirs with platform metadata per layer.

@jaimergp
Copy link
Contributor

Some versions cannot be translated to tags directly because of the rules of OCI registries.

For reference, rules are here: https://github.com/opencontainers/distribution-spec/blob/v1.0/spec.md#pulling-manifests

@Hind-M
Copy link
Contributor

Hind-M commented Apr 15, 2024

* we should say that conda / mamba / rattler use `oci://` as the scheme.

Hmm, AFAIK we are rather using https as scheme in mamba: https://ghcr.io/...
Last time I checked, I think curl was complaining about unknown oci protocol...

@wolfv
Copy link
Contributor Author

wolfv commented Apr 15, 2024

Yeah, but regular HTTPs requests don't work. That's why we need a middleware or some other layer that converts oci://... requests to https:// before sending them to cURL.

@jaimergp jaimergp linked an issue Apr 15, 2024 that may be closed by this pull request
wolfv and others added 2 commits April 25, 2024 16:06
Co-authored-by: jaimergp <jaimergp@users.noreply.github.com>
@Hind-M
Copy link
Contributor

Hind-M commented Jun 3, 2024

Added a commit but couldn't push it here.

@wolfv
Copy link
Contributor Author

wolfv commented Jun 3, 2024

Looks good, @Hind-M. Maybe you can make a PR against my branch (https://github.com/wolfv/ceps/tree/oci-cep). I can then merge it there, and the PR will be updated.


The regex expresses that names can only start with an alphanumeric letter.

In `conda`, names can start with an underscore and it is used by conda-forge (e.g. `_libgcc_mutex`). For this reason, we prepend packages with a leading underscore with the string `zzz`. The name would thus be changed to `zzz_libgcc_mutex`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In `conda`, names can start with an underscore and it is used by conda-forge (e.g. `_libgcc_mutex`). For this reason, we prepend packages with a leading underscore with the string `zzz`. The name would thus be changed to `zzz_libgcc_mutex`.
In `conda`, names can start with an underscore and it is used by conda-forge (e.g. `_libgcc_mutex`). For this reason, we prepend packages with a leading underscore with the string `internal`. The name would thus be changed to `internal_libgcc_mutex`.

internal seems more fitting than zzz 💤

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid we might be late here. ~74 packages are mirrored like this already. We would need to delete and re-mirror? Or can can we just copy images and labels?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, how can we safely remove internal_ from the names? It might be something being used already. What about internal___ (triple underscore)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment below on making this not ambiguous by merging this part of the spec with other encodings.

Copy link
Member

@jezdez jezdez Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid we might be late here. ~74 packages are mirrored like this already. We would need to delete and re-mirror? Or can can we just copy images and labels?

The mirrors are proofs of concept from the conda community's perspective (I think from conda-forge as well), so I don't think these matter. Re-mirroring 74 packages seems a small price to pay for a better spec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% agreed @jezdez readding 74 packages is simple enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking again about this, the leading underscore doesn't really affect package names but channel names, given our image name scheme: channel/subdir/package. So maybe we don't have to worry too much about this limitation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same thought but it turns out the underscore is not allowed after any other separator according the regex in the spec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh... Ok, then I said nothing! 😬

Comment on lines +125 to +127
- `+` is replaced by `__p__`
- `!` is replaced by `__e__`
- `=` is replaced by `__eq__`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That list seems to be a little random, it looks like Python dunder methods, but not quite valid, how about this instead?

Suggested change
- `+` is replaced by `__p__`
- `!` is replaced by `__e__`
- `=` is replaced by `__eq__`
- `+` is replaced by `__add__`
- `!` is replaced by `__not__`
- `=` is replaced by `__eq__`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p for plus and e for exclamation, I guess? These characters might not mean "adding" or "negating" in a version/build string, so I'm not sure whether this is a good way to solve the "randomness" here. These names are only for the OCI names, anyway, right? They are renamed once downloaded.

Copy link
Contributor

@Hind-M Hind-M Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p for plus and e for exclamation, I guess?

Yes!
These mappings (prepending the package names starting with _ with zzz and replacing + etc for tags) are really only intended to make them valid in order to be stored in an OCI registry (correct me if I'm wrong @wolfv). It doesn't impact anything elsewhere as the names and tags are still visible (by users when downloaded for example) as their original strings.
So I don't think having internal instead of zzz (same for the tags suggestions) would really matter (apart from making sure that it wouldn't conflict with potential similar existing strings).
While something more explicit would be preferable, given that a bunch of packages have already been mirrored that way as @jaimergp mentioned, I would say to keep it as is?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a standard encoding scheme we can rely on here instead of inventing our own?

Copy link
Contributor

@beckermr beckermr Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we insist of a special encoding, we can add underscores themselves as __under__, which would render our scheme unambiguous, since an existing package with __under__ in it would be mapped to __under____under__under__under____under__.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are conda package names case sensitive?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on your findings do you want to change anything with regard to the name mangling suggested above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conda package names are always lower case, so presumably case insensitive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@baszalmstra we might as well have everything lowercase since that is what the OCI registry will look like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case insensitive filesystems are usually case preserving though.


A given conda-package is identified by a URL like `<subdir>/<package-name>-<version>-<build>.<ext>` where `<subdir>` is the platform and architecture, `<package-name>` is the name of the package, `<version>` is the version of the package, `<build>` is the build string of the package, and `<ext>` is the extension of the package file.

#### Mapping the package name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these "mappings" reverted to their original names once downloaded/extracted? I would assume so and I think that should be part of the CEP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a note for that.

-Add me as author
-Remove jlap as not relevant anymore
@beckermr
Copy link
Contributor

I am emeritus so I cannot vote here, but IMHO the issues around encoding/decoding above need to be resolved before this spec can be passed.

@jakirkham
Copy link
Member

It seems like there is a fair amount of discussion still happening here for an active vote. Should we cancel the vote and reschedule after that discussion has reached a conclusion?


The regex expresses that names can only start with an alphanumeric letter.

In `conda`, names can start with an underscore and it is used by conda-forge (e.g. `_libgcc_mutex`). For this reason, we prepend packages with a leading underscore with the string `zzz`. The name would thus be changed to `zzz_libgcc_mutex`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if there is also a package called zzz_libgcc_mutex?

This seems like a possible attack vector.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to symbol name mangling I guess? Perhaps we should prefix all packages with a common prefix and a special one for underscores?

Comment on lines +125 to +127
- `+` is replaced by `__p__`
- `!` is replaced by `__e__`
- `=` is replaced by `__eq__`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a standard encoding scheme we can rely on here instead of inventing our own?

The text itself mentiones that package names can start with an underscore. _libgcc_mutex is the example given.

Perhaps we can learn from symbol name mangling instead of coming up with a unique scheme?

@Hind-M
Copy link
Contributor

Hind-M commented Aug 21, 2024

It seems like there is a fair amount of discussion still happening here for an active vote. Should we cancel the vote and reschedule after that discussion has reached a conclusion?

I think we can wait and see if we reach a conclusion before the voting deadline, and possibly extend it if necessary. If not, we would definitely reschedule yes.
I'm not familiar with the usual processes (of voting and such), so if this isn't the correct approach, please let me know and I'll adjust accordingly.

@jaimergp
Copy link
Contributor

jaimergp commented Aug 27, 2024

This is similar to symbol name mangling I guess? Perhaps we should prefix all packages with a common prefix and a special one for underscores?

I like this proposal by @baszalmstra, but we will need to remirror the whole thing.

Another alternative, as proposed by @wolfv, is to SHA256 encode the names and forget about it. You can map back by checking the internal metadata. This will also require a complete remirror.

But we can also simply SHA256 encode the name of container images that start with an underscore and leave everything else untouched. This wouldn't require a full remirror.

@beckermr
Copy link
Contributor

I think the name length limit is the bigger issue here though. I guess we'll need a fixed length hash of the name.

@wolfv
Copy link
Contributor Author

wolfv commented Aug 27, 2024

Let me clarify some things:

Name attacks are not really possible (for now) since we also mirror the repodata from conda-forge / Anaconda.org. From the repodata, we just use the SHA256 hash to directly reference the right blob. The names & tags for the individual packages are mainly "cosmetic".

Even if we would run our own indexing, we would refer back to the stored index.json file (and the name in there) and not derive back the original name from the OCI image.
So I am personally not super concerned by this.

But I can sympathize with a mapping that would disallow any such overlaps.

I also think that re-mirroring might not be a big issue, since we just need to move the names / tags in the OCI registry to the right places (SHA hashes and packages will stay the same).

@Hind-M
Copy link
Contributor

Hind-M commented Aug 28, 2024

So just to wrap up here:

  • Since we are constrained by the limit of characters, we could say that we should avoid mapping every underscore.
  • The names should be easily identifiable by users (useful when searching for packages in the registry), so using a hash instead of the name would be cryptic.
  • Appending the name hash or a part of it sounds weird (and somehow redundant?) - even if we do that for every package to be consistent (not only the ones starting with an underscore), but we still can.
  • Prepending a few characters like the first strategy (zzz) seems to be a good compromise.
    Maybe use something else instead like @jezdez suggested? (pkg, oci, int (for internal)...)
    I have a preference for oci, since it's suggesting that this is an oci specific thing.
    We can do this for all packages or just for the package names starting with _.

Should we think of handling exceeding the max limit of characters as well? (I'm not sure if this is something already taken care of).

@beckermr
Copy link
Contributor

beckermr commented Aug 28, 2024

We CANNOT only prepend to packages that start with an underscore. This would declare part of the package namespace off limits for all conda users. That action requires at minimum a separate CEP where we codify package names formally. We have to preprend to everything.

And yes we need to handle the max characters limit properly in this CEP.

@dholth
Copy link
Contributor

dholth commented Aug 28, 2024

Where is the specification for the length limit?
Do users have business browsing through the raw OCI anyway?

@beckermr
Copy link
Contributor

beckermr commented Aug 28, 2024

Tags have a limit of 128 characters: https://docs.docker.com/reference/cli/docker/image/tag/
Names have practical limits. See the implementers note in the OCI spec for pull: https://github.com/opencontainers/distribution-spec/blob/main/spec.md#pull.

It is hard to know if people want to browse an OCI registry. For sure github provides searches based on image names. See for example the existing OCI mirror (https://github.com/orgs/channel-mirrors/packages) based on a beta spec.

@Hind-M
Copy link
Contributor

Hind-M commented Sep 3, 2024

FYI, and considering the still ongoing discussions, the vote has been postponed to a later date that is yet to be determined.

@thomasjpfan
Copy link

It is hard to know if people want to browse an OCI registry. For sure github provides searches based on image names.

As a user, as long as https://conda-forge.org/packages/ (or https://bioconda.github.io/conda-package_index.html) has all the correct metadata and points to the correct images in the oci registry, I do not mind if the oci image name itself is hashed.

Concretely, I would be okay with

conda-forge/linux-64/43asfdsafr9434234:321424fdasfsdf

which has the format <hash_package_name>:<hash_version+build>

If it's important to keep some of the original package name in the oci image name, another option is to strip away the invalid characters in the names/tags and add the hash next to it:

<stripped_package_name><hashed_package_name>:<stripped_version+build>-<hash_version+build>

Something like this:

conda-forge/linux-64/numpy-fsafsdaffs:2.2.2-py313hc518a0f_0-43asfdsafr9434234

The stripped package names and tags can also be trimmed to make sure they are under the character limits.

@beckermr
Copy link
Contributor

beckermr commented Mar 8, 2025

I wrote some code to examine all packages from conda-forge for invalid names, the length of the OCI-encoded package, etc. Here are the results

invalid conda dist: pyside2-2.0.0~alpha0-py27_0.tar.bz2 (label broken-and-invalid)
invalid oci dist: conda-forge/linux-64/cpyside2:2.0.0~alpha0-py27__0 (label: broken-and-invalid)
invalid conda dist: pyside2-2.0.0~alpha0-py35_0.tar.bz2 (label broken-and-invalid)
invalid oci dist: conda-forge/linux-64/cpyside2:2.0.0~alpha0-py35__0 (label: broken-and-invalid)
invalid conda dist: pyside2-2.0.0~alpha0-py36_0.tar.bz2 (label broken-and-invalid)
invalid oci dist: conda-forge/linux-64/cpyside2:2.0.0~alpha0-py36__0 (label: broken-and-invalid)
invalid conda dist: pyside2-2.0.0~alpha0-py27_0.tar.bz2 (label broken-and-invalid)
invalid oci dist: conda-forge/osx-64/cpyside2:2.0.0~alpha0-py27__0 (label: broken-and-invalid)
invalid conda dist: pyside2-2.0.0~alpha0-py35_0.tar.bz2 (label broken-and-invalid)
invalid oci dist: conda-forge/osx-64/cpyside2:2.0.0~alpha0-py35__0 (label: broken-and-invalid)
invalid conda dist: pyside2-2.0.0~alpha0-py36_0.tar.bz2 (label broken-and-invalid)
invalid oci dist: conda-forge/osx-64/cpyside2:2.0.0~alpha0-py36__0 (label: broken-and-invalid)
max oci name length: 71
max oci tag length: 81

conda-forge has a few invalid artifact names that we've known about for a while.

Overall, we're pretty far from the oci limits which is good.

However, we should put something in the spec to handle cases where the limits are exceeded. I will write a proposal for that.

@beckermr
Copy link
Contributor

beckermr commented Mar 8, 2025

cc @jaimergp

@jaimergp

This comment was marked as outdated.

@beckermr
Copy link
Contributor

beckermr commented Mar 10, 2025

I have a working version of the spec with the updates from @baszalmstra and the hashing. Should we make a new PR, or should I edit this one directly? There is a failed vote on this PR, so I think a new one might be cleaner, but I don't want to erase history either.

Thoughts @jaimergp ?

@jaimergp
Copy link
Contributor

There is a failed vote on this PR, so I think a new one might be cleaner, but I don't want to erase history either.

I'm partial to opening a new PR with sufficient references to this PR and the open discussions.

@jaimergp
Copy link
Contributor

Closing in favor of #115. Thank you everyone for your comments and feedback!

@jaimergp jaimergp closed this Mar 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

vote Voting following governance policy

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CEP request: OCI packaging

9 participants