Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When doing lfs conversion, allow keeping protected files in the repo, instead of moving to lfs and re-committing them at HEAD #138

Open
miszobi opened this issue Apr 11, 2016 · 3 comments

Comments

@miszobi
Copy link

miszobi commented Apr 11, 2016

First of all thanks for the great tool!

When converting a repository to LFS that has protected files (in HEAD or any other protected branch), in the rewritten repository these files get converted to LFS initially, then on the HEAD commit they get converted back to their non-LFS counterparts.

This (along with creating a .gitattributes file for each affected directory, and deleting that file in HEAD), causes the conversion to be a bit messy, and unnatural, when reviewing the history later on.

It'd be great if for protected files that will be kept there was a possibility to keep them as non-LFS throughout their history instead.

@rtyley
Copy link
Owner

rtyley commented Apr 11, 2016

It'd be great if for protected files that will be kept there was a possibility to keep them as non-LFS throughout their history instead.

Object protection in the BFG works in a rather more subtle way than you might think...! Any folder (called 'tree' in Git) in your history that is exactly identical (has an identical Git SHA) to a folder in a protected commit will be protected (ie stay non-LFS). This is because of the way the BFG works - any object (folder or file) is only cleaned once, and the result stored against it's Git SHA - so 'protecting' an object just means pre-storing the ids of protected objects with a 'cleaned' value that is identical to the 'dirty' value. So, slightly more than you might think gets protected, because sometimes old commits are left with unchanged folders because of the protection of the head commit.

However, the very model of Git itself doesn't really recognise the 'history' of a file - and this means the BFG can not carefully preserve the state of your file throughout history, at least not in a performant way. As far as the BFG and Git are concerned, if a file contains even slightly different content, it's just a flat-out different file, and there's no concept of a file's 'history' to link them.

See also:

https://www.youtube.com/watch?v=4XpnKHJAok8#t=2625 and #53 (comment)

Perhaps I'm being too hard on the concept, because rather than protecting a file's Git SHA id, it should be possible to protect all files with the same name (not file path, but name). This still wouldn't handle renames (ie surprise.mp4 wouldn't protect all versions of suprise.mp4) - and it might lead to too much stuff being protected - but in your case a least it might get closer to what you're expecting.

From a purist point of view it's a bit dubious tho' and I'm not convinced it's worth the extra code!

@javabrett
Copy link
Contributor

Accepting that BFG memoizes trees and does not need to do-so for blobs; would it be technically feasible to start a BFG run with a tree-walk of the trees of all protected refs, pre-memoizing the blobs they contain, and during the commit-walk, perform a quick check that each blob being cleaned isn't in that set?

BFG would still not memoize blobs as it walked and memoized commits/trees - this blob-protection-list would be read-only. We can be 100% sure that we won't be removing that blob from the repo, since it is in a protected ref's tree, so we can just not clean it wherever it is found, which will have the side-effect of appearing to preserve its history (in the way that Git fashions this).

  • I haven't coded it, but I imagine this shouldn't have too much performance-penalty - a walk of the protected trees once up-front, and a check in that map for each blob's sha1 before it undergoes any other cleaning tests.
  • It's what users expect. Although, as has been nicely explained here and elsewhere, Git does not track the history of a file per-se, it can mimic it by showing the difference between trees. Because the protected-tree's blob is no-longer in the protected head's parent (because it was scrubbed from some ancestor), it appears to be added in the protected commit.

If that approach sounds feasible and there's conceptual support for it, perhaps a few folks can collaborate on the code-change.

javabrett added a commit to javabrett/bfg-repo-cleaner that referenced this issue May 17, 2016
…ot just in protected refs' trees. Fixed rtyley#49, rtyley#53, rtyley#138.

Added objectId exclusion filters during tree blob-cleaning, such that blobs that exist in the trees of protected refs as stored in the census (AKA dirt) are protected not only in those trees, but in any other trees in which they occur in in the walked-history.  This prevents the perception that those files are being deleted in-history and then re-added in the final re-written commit, which is the protected HEAD (with dirt) and its untouched tree.  This is more a perception because Git does not track the history of individual files, but it does show diffs and logs that indicate such changes, and the behaviour prior to this change is to remove protected blobs from non-protected history trees, retaining them only in the final HEAD ref, which then shows as an add in that commit when logs/diffs are taken.

This change is convenient if your clean-up selectors (by name, or size) do select some dirt (files still in HEADs that you want to keep) but you don't want those files to appear as if they were recently added to HEAD.
javabrett added a commit to javabrett/bfg-repo-cleaner that referenced this issue May 17, 2016
…ot just in protected refs' trees. Fixed rtyley#49, rtyley#53, rtyley#138.

Added objectId exclusion filters during tree blob-cleaning, such that blobs that exist in the trees of protected refs as stored in the census (AKA dirt) are protected not only in those trees, but in any other trees in which they occur in in the walked-history.  This prevents the perception that those files are being deleted in-history and then re-added in the final re-written commit, which is the protected HEAD (with dirt) and its untouched tree.  This is more a perception because Git does not track the history of individual files, but it does show diffs and logs that indicate such changes, and the behaviour prior to this change is to remove protected blobs from non-protected history trees, retaining them only in the final HEAD ref, which then shows as an add in that commit when logs/diffs are taken.

This change is convenient if your clean-up selectors (by name, or size) do select some dirt (files still in HEADs that you want to keep) but you don't want those files to appear as if they were recently added to HEAD.
javabrett added a commit to javabrett/bfg-repo-cleaner that referenced this issue Feb 6, 2018
…ot just in protected refs' trees. Fixed rtyley#49, rtyley#53, rtyley#138.

Added objectId exclusion filters during tree blob-cleaning, such that blobs that exist in the trees of protected refs as stored in the census (AKA dirt) are protected not only in those trees, but in any other trees in which they occur in in the walked-history.  This prevents the perception that those files are being deleted in-history and then re-added in the final re-written commit, which is the protected HEAD (with dirt) and its untouched tree.  This is more a perception because Git does not track the history of individual files, but it does show diffs and logs that indicate such changes, and the behaviour prior to this change is to remove protected blobs from non-protected history trees, retaining them only in the final HEAD ref, which then shows as an add in that commit when logs/diffs are taken.

This change is convenient if your clean-up selectors (by name, or size) do select some dirt (files still in HEADs that you want to keep) but you don't want those files to appear as if they were recently added to HEAD.
@masalati1122
Copy link

"git... never ever tracks a single file" youtu.be/4XpnKHJAok8#t=… So it models no direct link between different versions of the 'same' file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants