-
-
Notifications
You must be signed in to change notification settings - Fork 608
Open
Description
Hi, I maintain the R bindings for cmark. One popular use case is converting commonmark to xml for processing the AST.
We are running into a problem when input markdown contains control characters (often captured from a tty), which makes xml output invalid. For example if the markdown text contains \033
and we convert that to xml, we get:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">
<document xmlns="http://commonmark.org/xml/1.0">
<paragraph>
<text xml:space="preserve">�</text>
</paragraph>
</document>
However, trying to parse this with libxml2 fails:
Error in read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html, :
PCDATA invalid Char value 27 [9]
A real world example is this readme file. This was done with the gfm fork, but I think the problem appears the same.
Is this a bug in cmark, or is markdown text not supposed to contain c0 characters in the first place?
cc @nwellnhof
Metadata
Metadata
Assignees
Labels
No labels