Skip to content

Conversation

@fridex
Copy link
Owner

@fridex fridex commented Mar 9, 2023

Replaced with python#3076

@fridex fridex force-pushed the provenance branch 3 times, most recently from a7c1bd6 to 0cd4065 Compare March 9, 2023 11:49
Signed-off-by: Fridolin Pokorny <fridolin.pokorny@datadoghq.com>
@fridex
Copy link
Owner Author

fridex commented Mar 9, 2023

A PoC patch for pip to bring this functionality in fridex/pip#1

Signed-off-by: Fridolin Pokorny <fridolin.pokorny@datadoghq.com>
@fridex fridex force-pushed the provenance branch 2 times, most recently from a7ab14b to ec71fac Compare March 13, 2023 19:03
Signed-off-by: Fridolin Pokorny <fridolin.pokorny@datadoghq.com>
Signed-off-by: Fridolin Pokorny <fridolin.pokorny@datadoghq.com>
Co-authored-by: Brett Cannon <brett@python.org>
pep-9999.rst Outdated
``hashlib.algorithms_guaranteed`` SHOULD always be included. At the time of
writing, ``sha256`` specifically is recommended.

- A ``hash`` key (type ``string``) MAY be present for compatibility purposes
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what compatibility purposes? List why those are important in the context of this file? What existing code is expected to be processing these that will not look at archive_info.hashes? Link to it and explain why expecting up to date versions of that to exist for parsing these cannot be achieved.

ie, if pip is the reason, explain how an old version of pip not supporting this file would ever be run in the presence of an installation with these files put in place by an installer and how it would ever wind up reading them.

I personally think a new PEP and data file is an opportunity to not allow/support the legacy in this new file.

Copy link
Owner Author

@fridex fridex Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think a new PEP and data file is an opportunity to not allow/support the legacy in this new file.

The PEP tried to respect what is already implemented in pip - not to introduce large changes that would eventually be harder to get accepted. If the PEP could throw away what is legacy here, as you suggest in your comments, that would be great.

I'll try to incorporate your suggestions, they look very reasonable. Thank you.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what compatibility purposes? List why those are important in the context of this file? What existing code is expected to be processing these that will not look at archive_info.hashes? Link to it and explain why expecting up to date versions of that to exist for parsing these cannot be achieved.

ie, if pip is the reason, explain how an old version of pip not supporting this file would ever be run in the presence of an installation with these files put in place by an installer and how it would ever wind up reading them.

I personally think a new PEP and data file is an opportunity to not allow/support the legacy in this new file.

Thinking about this, I'm not sure whether this should go into this PEP. As the PoC reused DirectUrl to understand what is needed to change in sources, the actual implementation might introduce a new class or at least adjust the current one to better respect provenance_url.json and its proposed implementation in this PEP, not to create confusion with direct_url.json (also, its internal pip API).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the hash key is deprecated in direct_url.json, I don't think it's worth keeping here (which it looks like the latest draft removed!).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the hash key has been removed in the latest draft as suggested by @gpshead. This conversation was kept open to clarify whether any details for pip implementation should be included in the PEP. I see @gpshead approved the changes already so we might eventually resolve this (?).

Thanks to gpshead for this suggestion.

Signed-off-by: Fridolin Pokorny <fridolin.pokorny@datadoghq.com>
Thanks to gpshead for these suggestions.

Signed-off-by: Fridolin Pokorny <fridolin.pokorny@datadoghq.com>
Fridolin Pokorny added 2 commits March 16, 2023 15:31
Signed-off-by: Fridolin Pokorny <fridolin.pokorny@datadoghq.com>
Signed-off-by: Fridolin Pokorny <fridolin.pokorny@datadoghq.com>
Fridolin Pokorny added 2 commits March 16, 2023 18:29
Signed-off-by: Fridolin Pokorny <fridolin.pokorny@datadoghq.com>
Signed-off-by: Fridolin Pokorny <fridolin.pokorny@datadoghq.com>
@sbidoul
Copy link

sbidoul commented Mar 18, 2023

@fridex you can now simplify the specification part by referencing https://packaging.python.org/en/latest/specifications/direct-url-data-structure/.

Fridolin Pokorny added 3 commits March 20, 2023 14:07
Signed-off-by: Fridolin Pokorny <fridolin.pokorny@datadoghq.com>
Signed-off-by: Fridolin Pokorny <fridolin.pokorny@datadoghq.com>
Signed-off-by: Fridolin Pokorny <fridolin.pokorny@datadoghq.com>
@fridex
Copy link
Owner Author

fridex commented Mar 20, 2023

@fridex you can now simplify the specification part by referencing https://packaging.python.org/en/latest/specifications/direct-url-data-structure/.

Thanks, added to the references section for now - not sure if it should be directly used in the specification section as provenance_url.json diverges from direct_url.json as stated in the PEP-610/PyPA's spec page. The biggest difference are mostly fields that MUST/SHOULD be present in the provenance_url.json. See @gpshead's comment here, specifically hash and hashes differences. This might open a question whether DirectUrl should be reused in pip and whether there is a need to provide similar spec to the Direct URL Data Structure.

EDIT: The hash field is completely dropped in this PEP.

@sbidoul
Copy link

sbidoul commented Mar 20, 2023

@fridex thanks for the update. I personally think it is better to reference the direct url data structure in the specification section, mentioning that only archive_info is relevant and applicable for provenance_url.json.

By reusing the same specification, implementers can safely assume they can reuse the same code without having to dissect and compare the specs. It will also simplify maintenance if/when the spec evolves.

The hash key is clearly mentioned there as deprecated so it should not be an issue.

Signed-off-by: Fridolin Pokorny <fridolin.pokorny@datadoghq.com>
@fridex
Copy link
Owner Author

fridex commented Mar 21, 2023

@fridex thanks for the update. I personally think it is better to reference the direct url data structure in the specification section, mentioning that only archive_info is relevant and applicable for provenance_url.json.

By reusing the same specification, implementers can safely assume they can reuse the same code without having to dissect and compare the specs. It will also simplify maintenance if/when the spec evolves.

The hash key is clearly mentioned there as deprecated so it should not be an issue.

Thanks, this seems reasonable as it follows what is already implemented in pip and is closer to the proposed pip PR and PEP-610. On the other hand, the proposed PEP is shifting away from this based on these two discussions with @gpshead:


Old text in this draft PEP, taken from PEP-610:

  • The hashes key SHOULD be present as a dictionary mapping a hash name to a hex encoded digest of the file.

A comment by @gpshead (link):

I'd change this to MUST. There is no point in this data if hash information is not available. This being a new file, it shouldn't need carry along the legacy "hash": "algorithm_name=hex_value" baggage.

Which resulted in a change to this wording:

The value of archive_info MUST be a dictionary with a single key hashes.


Old text in this draft PEP, taken from PEP-610:

When both the hash and hashes keys are present, the hash represented in
the hash key MUST also be present in the hashes dictionary, so consumers can consider the hashes key only if it is present, and fall back to hash otherwise.

A comment by @gpshead (link):

Very good! This is important. But this is the implementation complexity I'd rather we not even have. Code can be written that deals with this JSON but does not do this and does not validate equality of hashes specified in more than one place, and blindly uses the simpler hash field... which is then a security problem. get rid of the legacy field entirely in this file and the chance of that drops.

granted someone is still going to include the field. and will pass these data structures elsewhere to code that uses that instead of a hashes dict. but unless we require that code parsing these validate explicitly that the legacy "hash" key is not present, that unintended consequences bug cannot be avoided.

This led to a complete removal of the hash field.

It looks like we will need to make a consensus here.


BTW we should move this discussion to discuss.python.org as requested by Paul. It seems like it will make it easier for Paul to review it and also include other folks. I try to keep the draft PEP text up to date there.

@sbidoul
Copy link

sbidoul commented Mar 21, 2023

BTW we should move this discussion to discuss.python.org

@fridex I already raised this point at https://discuss.python.org/t/draft-pep-recording-provenance-of-installed-packages/24838/15?u=sbidoul

I don't think there is any incompatibility with what @gpshead said, though:

  • Regarding hash: the direct URL data structure spec already says A deprecated hash key (type string) MAY be present for backwards compatibility purposes,, so it is already implicit that new tools should not generate it. But it does not harm to reinforce this here.
  • Whether hashes is mandatory is a matter of opinion I guess. I personally think it's better if direct_url.json and provenance_url.json have the same behaviour in this respect. But even if we decide that a difference is desirable that does not prevent referencing the direct url data structure spec and add an additional requirement here.

This section now states Conda installer and other downstream installers.

Signed-off-by: Fridolin Pokorny <fridolin.pokorny@datadoghq.com>
@fridex
Copy link
Owner Author

fridex commented Mar 26, 2023

BTW we should move this discussion to discuss.python.org

@fridex I already raised this point at https://discuss.python.org/t/draft-pep-recording-provenance-of-installed-packages/24838/15?u=sbidoul

I don't think there is any incompatibility with what @gpshead said, though:

  • Regarding hash: the direct URL data structure spec already says A deprecated hash key (type string) MAY be present for backwards compatibility purposes,, so it is already implicit that new tools should not generate it. But it does not harm to reinforce this here.
  • Whether hashes is mandatory is a matter of opinion I guess. I personally think it's better if direct_url.json and provenance_url.json have the same behaviour in this respect. But even if we decide that a difference is desirable that does not prevent referencing the direct url data structure spec and add an additional requirement here.

Let's have a sync once the PEP is actually submitted. I kept the hash key out of the spec and included it in the rejected ideas section. The hashes field is mandatory in the spec now. The compatibility with direct_url.json might be nice, then the question might be enforcing availability of these fields, which might be worth discussion.

@sbidoul thanks again for your very valuable feedback!

Fridolin Pokorny added 4 commits March 27, 2023 00:35
Signed-off-by: Fridolin Pokorny <fridolin.pokorny@datadoghq.com>
Signed-off-by: Fridolin Pokorny <fridolin.pokorny@datadoghq.com>
Signed-off-by: Fridolin Pokorny <fridolin.pokorny@datadoghq.com>
Signed-off-by: Fridolin Pokorny <fridolin.pokorny@datadoghq.com>
@fridex
Copy link
Owner Author

fridex commented Mar 29, 2023

Replaced with python#3076

@fridex fridex closed this Mar 29, 2023
@fridex fridex deleted the provenance branch March 29, 2023 10:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants