Skip to content

Conversation

@mrdb
Copy link

@mrdb mrdb commented Mar 17, 2025

Claude 3.7 Sonnet (I appologise if I am doing this incorrectly, I am new to GitHub and development and am learning as I go):

Based on my analysis, I've found the root cause of your issue with Buildware. The error "invalid byte sequence for encoding UTF8: 0x00" occurs when the application tries to store binary data with null bytes (0x00) in a PostgreSQL text field.

Here's the issue and solution:

Problem Diagnosis

  1. The error happens during the embedBranch function when updating the embedded_files table in your Supabase database.

  2. Specifically, in actions/github/fetch-files.ts, binary files are being read with:

    content: Buffer.from(data.content, "base64").toString("utf-8")
  3. While there's some filtering to exclude common image files, there are likely other binary files in your repository that contain null bytes (0x00) which are invalid in PostgreSQL UTF-8 encoding.

  4. The current file exclusion list is incomplete and doesn't catch all binary file types.

Solution

I recommend improving the binary file detection and handling in the fetch-files.ts file:

Additionally, I recommend modifying the embedFiles function to handle empty file arrays and provide better error handling:

Implementation Steps

  1. Replace the existing actions/github/fetch-files.ts with the modified version I provided.

  2. Replace the existing actions/github/embed-files.ts with the modified version I provided.

  3. Restart your application to apply these changes.

Explanation

The primary issue is that binary files with null bytes (0x00) can't be stored in PostgreSQL UTF-8 text fields. My solution:

  1. Enhanced Binary File Detection: Added more thorough binary file type detection based on file extensions.

  2. Null Byte Scanning: Added explicit checks for null bytes in file content before attempting to store them.

  3. Error Handling: Improved error handling to skip problematic files rather than failing the entire operation.

  4. Content Sanitization: Added content sanitization as a fallback to remove any null bytes that might have slipped through.

This approach should prevent the UTF-8 encoding errors while still allowing the system to process and embed the relevant text files in your codebase.

Additional Recommendations

  1. Consider using a BYTEA column type in PostgreSQL to store binary content if you need to analyze binary files in the future.

  2. You might want to add a database migration to handle any existing corrupted records.

  3. Consider adding a logging system to track which files are being skipped to help debug any future issues.

Would you like me to explain any part of the solution in more detail?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant