Skip to content

Git pre-commit hook: strip trailing whitespace #50

@gwern

Description

@gwern

Haskell source files are usually clean because I have an Emacs save-hook which strips trailing whitespace on Haskell/R/Bash/Python files. This doesn't help on the CSS/JS/PHP because usually someone else (like Said Achmiz or a LLM) is editing them. This leads to noisy diffs (and slightly wasteful token use).

We should add a simple git hook to strip trailing whitespace from all text files in the repo before each commit is saved. (As far as I know, no text files in it depend on trailing whitespace.)


Claude-4.7-opus suggests a build/git-hooks/pre-commit VC-controlled Bash script enabled by git config core.hooksPath build/git-hooks, that operates on the git index directly (rather than the working tree). We can further ensure that text files are valid UTF-8 Unix text files without legacy BOMs, which I believe should also be true of all repo text files, and is worth ensuring.

Claude's prototype:

#!/usr/bin/env bash
# build/git-hooks/pre-commit — sanitize staged text files.
# - validate UTF-8 (abort commit on invalid bytes)
# - strip trailing whitespace
# - normalize CRLF -> LF
# - strip leading UTF-8 BOM
# Operates on the git index only; working tree is never touched.
set -euo pipefail

fail=0

while IFS= read -r -d '' file; do
    read -r mode oid _ < <(git ls-files --stage -- "$file")
    [[ -z "${oid:-}" ]] && continue

    # Binary sniff: NUL byte in the first 8 KiB of the blob.
    if git cat-file -p "$oid" | head -c 8192 | LC_ALL=C grep -q -m1 $'\0'; then
        continue
    fi

    # UTF-8 validation: iconv exits nonzero at the first invalid byte.
    if ! git cat-file -p "$oid" |
         iconv --from-code=UTF-8 --to-code=UTF-8 >/dev/null 2>&1; then
        printf 'pre-commit: %s contains invalid UTF-8; aborting commit.\n' "$file" >&2
        fail=1
        continue
    fi

    # Strip leading BOM, normalize CRLF, strip trailing whitespace.
    new_oid=$(git cat-file -p "$oid" |
              perl -0777 -pe 's/^\xEF\xBB\xBF//; s/[ \t\r]+$//mg' |
              git hash-object --write --stdin)

    [[ "$new_oid" == "$oid" ]] && continue
    git update-index --cacheinfo "$mode,$new_oid,$file"
    printf 'pre-commit: cleaned %s\n' "$file" >&2
done < <(git diff --cached --name-only --diff-filter=AM -z)

exit "$fail"

Metadata

Metadata

Assignees

Labels

AIIssue which can probably be resolved by agentic LLM AI coding, rather than scarce humans.BackendOriginal content, Hakyll/Haskell/scripts, Markdown/HTML etc. Usually not CSS/JS. Owner: Gwern.enhancementNew feature or requestgood first issueGood for newcomers

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions