-
Notifications
You must be signed in to change notification settings - Fork 207
Convert pasted HTML to Markdown in Markdown tabs #673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
mitya57
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this pull request!
I like the idea, but left some comments.
ReText/editor.py
Outdated
| QTextCursor, | ||
| QTextFormat, | ||
| QTextOption, | ||
| QTextDocument, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep this list sorted. QTextDocument should be between QTextCursor and QTextFormat.
ReText/editor.py
Outdated
| document = QTextDocument() | ||
| document.setHtml(html) | ||
| to_markdown = getattr(document, 'toMarkdown', None) | ||
| if callable(to_markdown): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this check needed?
toMarkdown seems to be available since Qt 5.14, and we are requiring Qt 6.
ReText/editor.py
Outdated
| imageText = self.getImageMarkup(url) | ||
| self.textCursor().insertText(imageText) | ||
| return | ||
| if source.hasHtml() and getattr(self, 'tab', None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.tab should be always there.
ReText/editor.py
Outdated
| super().inputMethodEvent(event) | ||
|
|
||
| def _convertHtmlToMarkdown(self, html): | ||
| if not html: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this really happen? I see you are passing source.html() to this function but only if source.hasHtml() returns True, so it should be never None?
ReText/mdx_tablength.py
Outdated
| from markdown.preprocessors import Preprocessor | ||
|
|
||
|
|
||
| class _ListIndentPreprocessor(Preprocessor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a potentially breaking change, since it changes how existing documents are interpreted.
Also, it makes ReText deviate from the Markdown specification, which mandates that “Each subsequent paragraph in a list item must be indented by either 4 spaces or one tab”.
If the problem is Qt, which uses different indentation, can we normalize it on insert instead? It also will be better performance-wise, since we normalize indentation only once, not on each render.
…ng, not to the rendering logic
|
I tried to tackle all the things you mentioned. Please review again |
ReText/converterprocess.py
Outdated
| current_markup = markup_class(job['filename']) | ||
| current_markup.requested_extensions = job['requested_extensions'] | ||
|
|
||
| current_markup.requested_extensions = job['requested_extensions'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we don't have a new extension, changes in this file are probably not needed anymore?
ReText/tab.py
Outdated
|
|
||
| PreviewDisabled, PreviewLive, PreviewNormal = range(3) | ||
|
|
||
| def get_markdown_requested_extensions(sync_scroll_enabled): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And changes in this file seem also not needed anymore?
| def test_pasteHtmlConvertedToMarkdown(self): | ||
| mimeData = QMimeData() | ||
| mimeData.setText('plain text fallback') | ||
| mimeData.setHtml('<p>Hello <strong>world</strong></p>') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a test that will check indentation normalization?
ReText/editor.py
Outdated
| markdown = self._normalize_markdown_list_indentation_text(markdown) | ||
| return markdown.rstrip('\n') | ||
|
|
||
| def _normalize_markdown_list_indentation_text(self, text): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please use camelCase? I know it's not PEP 8 compliant, but I use it in ReText because PyQt uses it.
Also I think the word text in this function name is not needed?
| def _normalize_markdown_list_indentation_text(self, text): | |
| def _normalizeMarkdownListIndentation(self, text): |
|
This is the squashed diff. I hope, it fits this time. diff --git a/ReText/converterprocess.py b/ReText/converterprocess.py
index c91ec34..389e7e1 100644
--- a/ReText/converterprocess.py
+++ b/ReText/converterprocess.py
@@ -191,4 +191,3 @@ class ConverterProcess(QObject):
def stop(self):
sendObject(self.conn, {'command': 'quit'})
self.conn.close()
-
diff --git a/ReText/editor.py b/ReText/editor.py
index 8b10bac..3e6f023 100644
--- a/ReText/editor.py
+++ b/ReText/editor.py
@@ -46,6 +46,7 @@ from PyQt6.QtGui import (
QPainter,
QPalette,
QTextCursor,
+ QTextDocument,
QTextFormat,
QTextOption,
QWheelEvent,
@@ -404,11 +405,31 @@ class ReTextEdit(QTextEdit):
matchedText = matchedPrefix + str(nextNumber) + ". "
else:
matchedText = ''
+ # For Markdown, normalize nested list indentation to multiples of 4 spaces
+ # to comply with the Markdown specification for list indentation.
+ markupClass = self.tab.getActiveMarkupClass()
+ if matchedText and markupClass == MarkdownMarkup:
+ matchedText = self._normalizeListIndent(matchedText)
# Reset the cursor
cursor = self.textCursor()
cursor.insertText('\n' + matchedText)
self.ensureCursorVisible()
+ def _normalizeListIndent(self, prefix):
+ # Split prefix into indentation and the rest (e.g., " - ", " 1. ")
+ i = 0
+ while i < len(prefix) and prefix[i] in (' ', '\t'):
+ i += 1
+ indent = prefix[:i]
+ rest = prefix[i:]
+ # Only normalize when the rest looks like a list marker
+ if re.match(r'(?:[-+*]|\d+\.)\s', rest):
+ indent_len = len(indent.expandtabs(4))
+ if indent_len and (indent_len < 4 or indent_len % 4 != 0):
+ new_len = max(4, ((indent_len + 3) // 4) * 4)
+ return (' ' * new_len) + rest
+ return prefix
+
def moveLineUp(self):
self.moveLine(QTextCursor.MoveOperation.PreviousBlock)
@@ -594,6 +615,38 @@ class ReTextEdit(QTextEdit):
if event.preeditString() or event.commitString() or event.attributes():
super().inputMethodEvent(event)
+ def _convertHtmlToMarkdown(self, html):
+ document = QTextDocument()
+ document.setHtml(html)
+ markdown = document.toMarkdown()
+ markdown = self._normalizeMarkdownListIndentation(markdown)
+ return markdown.rstrip('\n')
+
+ def _normalizeMarkdownListIndentation(self, text):
+ # Normalize nested list indentation in a multi-line Markdown string.
+ # Ensures sublists are indented by multiples of 4 spaces.
+ fence_re = re.compile(r'^\s*`{3,}.*$')
+ item_re = re.compile(r'^(?P<indent>[ \t]+)(?P<marker>(?:[-+*]|\d+\.))\s+')
+ in_fence = False
+ lines = text.splitlines(True)
+ out = []
+ for line in lines:
+ if fence_re.match(line):
+ in_fence = not in_fence
+ out.append(line)
+ continue
+ if not in_fence:
+ m = item_re.match(line)
+ if m:
+ indent = m.group('indent')
+ indent_len = len(indent.expandtabs(4))
+ if indent_len and (indent_len < 4 or indent_len % 4 != 0):
+ new_len = max(4, ((indent_len + 3) // 4) * 4)
+ new_indent = ' ' * new_len
+ line = new_indent + line[len(indent):]
+ out.append(line)
+ return ''.join(out)
+
def insertFromMimeData(self, source):
if source.hasUrls():
url = source.urls()[0]
@@ -604,6 +657,13 @@ class ReTextEdit(QTextEdit):
imageText = self.getImageMarkup(url)
self.textCursor().insertText(imageText)
return
+ if source.hasHtml():
+ markupClass = self.tab.getActiveMarkupClass()
+ if markupClass == MarkdownMarkup:
+ markdown = self._convertHtmlToMarkdown(source.html())
+ if markdown:
+ self.textCursor().insertText(markdown)
+ return
return super().insertFromMimeData(source)
diff --git a/tests/test_editor.py b/tests/test_editor.py
index cbdcecf..37e8772 100644
--- a/tests/test_editor.py
+++ b/tests/test_editor.py
@@ -112,6 +112,17 @@ class TestClipboardHandling(unittest.TestCase):
self.editor.insertFromMimeData(mimeData)
self.assertTrue('pasted text' in self.editor.toPlainText())
+ def test_pasteHtmlConvertedToMarkdown(self):
+ mimeData = QMimeData()
+ mimeData.setText('plain text fallback')
+ mimeData.setHtml('<p>Hello <strong>world</strong></p>')
+ self.dummytab.markupClass = MarkdownMarkup
+
+ self.editor.insertFromMimeData(mimeData)
+
+ pasted = self.editor.toPlainText()
+ self.assertIn('Hello **world**', pasted)
+
@patch.object(ReTextEdit, 'getImageFilename', return_value='/tmp/myimage.jpg')
@patch.object(QImage, 'save')
def test_pasteImage_Markdown(self, _mock_image, _mock_editor):
@@ -291,5 +302,15 @@ class TestOrderedListMode(unittest.TestCase):
QTest.keyClicks(editor, 'World')
self.assertEqual(editor.document().toPlainText(), '1. Hello\n1. World')
+ def test_markdown_list_indentation_normalization(self):
+ editor = ReTextEdit(self)
+ editor.tab = self.DummyReTextTab()
+ editor.tab.markupClass = MarkdownMarkup
+ # Type a list item with two-space indent, then press Return.
+ QTest.keyClicks(editor, ' - Foo')
+ QTest.keyClick(editor, Qt.Key.Key_Return)
+ QTest.keyClicks(editor, 'Bar')
+ self.assertEqual(editor.document().toPlainText(), ' - Foo\n - Bar')
+
if __name__ == '__main__':
unittest.main() |
tests/test_editor.py
Outdated
| QTest.keyClicks(editor, 'World') | ||
| self.assertEqual(editor.document().toPlainText(), '1. Hello\n1. World') | ||
|
|
||
| def test_markdown_list_indentation_normalization(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the updates!
This test is good, but does it really test your _normalizeMarkdownListIndentation function? It looks like no HTML to markdown conversion is happening here, and this test is testing the editor functionality of preserving indentation instead?
If I'm right, you may keep this test and add another one. Or, maybe just add more complex content (that requires normalization) to your test_pasteHtmlConvertedToMarkdown function above?
Also, please rename this test to camelCase if you keep it.
|
Hi, can you merge and throw out the test if it'd be fine? |
|
Done, thank you again! I will look at the new PR a bit later. |
|
Oh, I saw you threw out the list normalization algo. This is needed since the html-to-markdown helper function does not produce correct markdown with nested lists. you can try to paste nested lists |
|
I kept the Isn't it enough? What HTML do I need to insert to make it broken? |
|
This is the example I used: <meta http-equiv="content-type" content="text/html; charset=utf-8"><h2 data-start="1203" data-end="1250">🔑 Warum der Wechsel zu den Offenen Brüdern?</h2>
<ol data-start="1251" data-end="2463">
<li data-start="1251" data-end="1660">
<p data-start="1254" data-end="1282"><strong data-start="1254" data-end="1280">Strenge der Exklusiven</strong></p>
<ul data-start="1286" data-end="1660">
<li data-start="1286" data-end="1551">
<p data-start="1288" data-end="1384">In Neugersdorf (wie in vielen Orten) erlebten die Leute die <strong data-start="1348" data-end="1381">exklusiven Regeln als zu hart</strong>:</p>
<ul data-start="1390" data-end="1551">
<li data-start="1390" data-end="1426">
<p data-start="1392" data-end="1426">Absonderung von „Nicht-Brüdern“.</p>
</li>
<li data-start="1432" data-end="1480">
<p data-start="1434" data-end="1480">Ausschluss wegen kleinster Lehrabweichungen.</p>
</li>
<li data-start="1486" data-end="1551">
<p data-start="1488" data-end="1551">kaum Kontakt mit Landeskirchlern, Baptisten oder Methodisten.</p>
</li>
</ul>
</li>
<li data-start="1555" data-end="1660">
<p data-start="1557" data-end="1660">Das passte schlecht in eine kleine Stadt, wo man auf Nachbarn, Kollegen und Verwandte angewiesen war.</p>
</li>
</ul>
</li>
<li data-start="1662" data-end="1955">
<p data-start="1665" data-end="1710"><strong data-start="1665" data-end="1708">Einfluss missionarisch geprägter Brüder</strong></p>
<ul data-start="1714" data-end="1955">
<li data-start="1714" data-end="1842">
<p data-start="1716" data-end="1842">Durch Kontakte zu Offenen Brüdern in Sachsen und Thüringen sah man, dass <strong data-start="1789" data-end="1831">Gemeinden auch ohne Absonderung blühen</strong> konnten.</p>
</li>
<li data-start="1846" data-end="1955">
<p data-start="1848" data-end="1955">Offene Brüder waren evangelistisch aktiv → passte besser zur pietistisch-erweckten Mentalität der Region.</p>
</li>
</ul>
</li>
<li data-start="1957" data-end="2192">
<p data-start="1960" data-end="1998"><strong data-start="1960" data-end="1996">Familien- und Gemeindespaltungen</strong></p>
<ul data-start="2002" data-end="2192">
<li data-start="2002" data-end="2083">
<p data-start="2004" data-end="2083">Die Exklusiven hatten oft <strong data-start="2030" data-end="2053">innere Zerwürfnisse</strong> (wer ist rein, wer nicht?).</p>
</li>
<li data-start="2087" data-end="2192">
<p data-start="2089" data-end="2192">In Neugersdorf führte das zu Spannungen – die <strong data-start="2135" data-end="2189">Offenen Brüder wirkten friedlicher und praktischer</strong>.</p>
</li>
</ul>
</li>
<li data-start="2194" data-end="2463">
<p data-start="2197" data-end="2223"><strong data-start="2197" data-end="2221">DDR-Zeit (nach 1945)</strong></p>
<ul data-start="2227" data-end="2463">
<li data-start="2227" data-end="2306">
<p data-start="2229" data-end="2306">Unter staatlicher Beobachtung war strikte Absonderung schwer durchzuhalten.</p>
</li>
<li data-start="2310" data-end="2463">
<p data-start="2312" data-end="2463">Gemeinden, die <strong data-start="2327" data-end="2338">offener</strong> waren, konnten leichter bestehen, weil sie besser vernetzt waren und nicht in endlosen internen Streitigkeiten erstarrten.</p>
</li>
</ul>
</li>
</ol> |
|
ok nice :) for some reasons my test during implementation didn't work out that well but nice when it works |
With this commit, I can paste formatted text now which automatically gets converted to Markdown.
This way, copy&paste from ChatGPT won't destroy document structure like ordered/unordered lists and headings