Skip to content

Conversation

@derduher
Copy link
Collaborator

Summary

Comprehensive security audit and fixes for lib/sitemap-parser.ts to protect against DoS attacks and handle untrusted XML inputs safely. This PR includes 12 security fixes (Critical → Medium priority) and adds 30 comprehensive security tests.


🔴 Critical Security Fixes

1. Fixed Data Corruption Bug

  • Issue: dontpushCurrentLink flag was never reset to false, causing all subsequent xhtml:link elements to be silently dropped after the first alternate/amphtml link
  • Impact: Data loss - links were permanently ignored for rest of document
  • Fix: Reset flag to false in closetag handler

2. Fixed Type Check Logic Bug

  • Issue: Incorrect validation of xhtml:link attributes (lines 122-125 checked wrong types)
  • Impact: Malformed attributes could bypass validation
  • Fix: Added proper attribute validation using getAttrValue() helper

3. DoS Protection via Resource Limits

Added limits to prevent memory exhaustion attacks:

  • Max 50,000 URL entries per sitemap (protocol compliance)
  • Max 1,000 images per URL
  • Max 100 videos per URL
  • Max 100 links per URL
  • Max 32 tags per video

4. String Bombing Protection

Added length limits to prevent unbounded string concatenation:

  • Video title: 100 chars
  • Video description: 2,048 chars
  • News title: 200 chars
  • News name: 256 chars
  • Image caption/title: 512 chars

🟡 High Priority Fixes

5. Numeric Validation

Reject invalid numeric values to prevent NaN/Infinity injection:

  • Priority: 0-1 range validation
  • Video duration: 0-28,800 seconds (0-8 hours)
  • Video rating: 0-5 range
  • View count: Positive integers only

6. URL Validation

Prevent malicious URLs:

  • Max length: 2,048 characters
  • Protocol: Must start with http:// or https://
  • Blocks: javascript:, file:, data: URLs

7. Date Validation

Enforce ISO 8601 format for all date fields:

  • lastmod
  • video:publication_date
  • video:expiration_date
  • news:publication_date

8. Enum Validation

Strict validation for news:access values (Registration | Subscription)


🔧 Error Handling Improvements

9. Multiple Error Collection

  • Before: Only first error stored in this.error
  • After: All errors collected in this.errors[] array
  • Benefit: Comprehensive error reporting for debugging

⚠️ Breaking Changes

Removed XMLToSitemapItemStream.error Property

  • Old: parser.error (single Error | null)
  • New: parser.errors (Error[])
  • Migration: Replace parser.error with parser.errors[0] or parser.errors
  • Reason: Support comprehensive error collection; simpler API

✅ Test Coverage

  • Added: 30 comprehensive security tests
  • Total tests: 207 (all passing)
  • Coverage:
    • Lines: 90.37%
    • Statements: 90.23%
    • Branches: 84.13%

Test Categories

  • ✅ URL validation (oversized, malicious protocols)
  • ✅ Resource limits (images, videos, links, tags)
  • ✅ String length limits (title, description, caption)
  • ✅ Numeric validation (NaN, Infinity, range checks)
  • ✅ Date validation (ISO 8601 format)
  • ✅ Enum validation (news:access)
  • ✅ Attribute handling (video player_loc, price, etc.)
  • ✅ Bug fix verification (dontpushCurrentLink)
  • ✅ Error collection (multiple errors)

📊 Security Impact

Attack Vector Before After
Billion Laughs (entity expansion) ⚠️ Partial protection ⚠️ Partial (SAX limitation)
Memory exhaustion (unbounded arrays) ❌ Vulnerable ✅ Protected (limits enforced)
String bombing (unbounded concat) ❌ Vulnerable ✅ Protected (length limits)
URL injection (javascript:, file:) ❌ No validation ✅ Protected (protocol check)
Invalid data (NaN, Infinity) ⚠️ Accepted ✅ Rejected
Protocol violations (>50k URLs) ⚠️ No limit ✅ Enforced

📝 Notes

SAX Parser Limitations

The underlying SAX parser (sax@1.4.1) has limited built-in protection against XXE/entity expansion attacks. While we've added application-level validation, complete protection requires:

  1. Input size limits (handled by 50MB constraint)
  2. Resource limits (✅ implemented)
  3. Consider upgrading SAX or alternative parser in future

Coverage Note

The 9.77% of uncovered code consists primarily of defensive error handlers for malformed SAX attribute objects - edge cases that are difficult to trigger but provide important defensive programming.


🔍 Files Changed

  • lib/sitemap-parser.ts - Security validation & bug fixes
  • tests/sitemap-parser-security.test.ts - New comprehensive security test suite

🤝 Migration Guide

// Before (v8.x)
const parser = new XMLToSitemapItemStream();
// ... parse XML
if (parser.error) {
  console.error('First error:', parser.error);
}

// After (v9.x)
const parser = new XMLToSitemapItemStream();
// ... parse XML
if (parser.errors.length > 0) {
  console.error('Errors found:', parser.errors);
  // or just first error:
  console.error('First error:', parser.errors[0]);
}

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

## Critical Security Fixes

- Fix critical logic bug in dontpushCurrentLink flag that caused data loss
- Fix incorrect type check for xhtml:link attributes
- Add validation limits to prevent DoS attacks via resource exhaustion
- Remove legacy error property (breaking change - use errors array)

## Validation Added

### Resource Limits
- Max 50,000 URL entries per sitemap (protocol compliance)
- Max 1,000 images per URL
- Max 100 videos per URL
- Max 100 links per URL
- Max 32 tags per video

### String Length Limits
- Video title: 100 chars
- Video description: 2,048 chars
- News title: 200 chars
- News name: 256 chars
- Image caption/title: 512 chars

### Input Validation
- URL format validation (http/https only, max 2,048 chars)
- Numeric validation (reject NaN, Infinity, enforce ranges)
- Date validation (ISO 8601 format)
- Enum validation (news:access values)

## Error Handling Improvements

- Collect all errors in errors[] array instead of just first error
- Enhanced error messages with context
- Support for comprehensive error reporting

## Test Coverage

- Added 30 comprehensive security tests
- All 207 tests passing
- Coverage: 90.37% lines, 90.23% statements, 84.13% branches
- Tests cover: URL validation, resource limits, string limits,
  numeric validation, date validation, enum validation, attribute
  handling, and bug fixes

## Breaking Changes

- Removed XMLToSitemapItemStream.error property
- Use XMLToSitemapItemStream.errors array instead
- ErrorLevel.THROW now throws first error from errors array

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@derduher derduher merged commit 55f65d5 into master Oct 12, 2025
6 checks passed
@derduher derduher deleted the security/sitemap-parser-validation-fixes branch October 12, 2025 23:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants