Illegal control characters in XML output

Hi, I maintain the R bindings for cmark. One popular use case is converting commonmark to xml for processing the AST.

We are running into a problem when input markdown contains control characters (often captured from a tty), which makes xml output invalid. For example if the markdown text contains `\033` and we convert that to xml, we get:


```
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">
<document xmlns="http://commonmark.org/xml/1.0">
  <paragraph>
    <text xml:space="preserve"></text>
  </paragraph>
</document>
```

However, trying to parse this with libxml2 fails: 

```
 Error in read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html,  : 
  PCDATA invalid Char value 27 [9] 
```
A real world example is [this readme file](https://github.com/ropensci/rgnparser/tree/bc4703248e4a2b4d67fa962f0cbe571fef987c76#gn_debug). This was done with the gfm fork, but I think the problem appears the same. 

Is this a bug in cmark, or is markdown text not supposed to contain c0 characters in the first place?

cc @nwellnhof

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Illegal control characters in XML output #365

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Illegal control characters in XML output #365

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions