Skip to content

Illegal control characters in XML output #365

@jeroen

Description

@jeroen

Hi, I maintain the R bindings for cmark. One popular use case is converting commonmark to xml for processing the AST.

We are running into a problem when input markdown contains control characters (often captured from a tty), which makes xml output invalid. For example if the markdown text contains \033 and we convert that to xml, we get:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">
<document xmlns="http://commonmark.org/xml/1.0">
  <paragraph>
    <text xml:space="preserve">�</text>
  </paragraph>
</document>

However, trying to parse this with libxml2 fails:

 Error in read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html,  : 
  PCDATA invalid Char value 27 [9] 

A real world example is this readme file. This was done with the gfm fork, but I think the problem appears the same.

Is this a bug in cmark, or is markdown text not supposed to contain c0 characters in the first place?

cc @nwellnhof

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions