Convert html to docx, this project is a fork from descontinued pqzx/html2docx.
pip install html-for-docx
Add HTML-formatted content to an existing .docx document
from html4docx import HtmlToDocx
parser = HtmlToDocx()
html_string = '<h1>Hello world</h1>'
parser.add_html_to_document(html_string, filename_docx)You can use python-docx to manipulate directly the file, here an example
from docx import Document
from html4docx import HtmlToDocx
document = Document()
parser = HtmlToDocx()
html_string = '<h1>Hello world</h1>'
parser.add_html_to_document(html_string, document)
document.save('your_file_name.docx')or incrementally add new html to document and save it when finished, new content will always be added at the end
from docx import Document
from html4docx import HtmlToDocx
document = Document()
parser = HtmlToDocx()
for part in ['First', 'Second', 'Third']:
parser.add_html_to_document(f'<h1>{part} Part</h1>', document)
parser.save('your_file_name.docx')When you pass a Document object, you can either use document.save() from python-docx or parser.save() from html4docx, both works well.
Both supports saving it in-memory, using BytesIO.
from io import BytesIO
from docx import Document
from html4docx import HtmlToDocx
buffer = BytesIO()
document = Document()
parser = HtmlToDocx()
html_string = '<h1>Hello world</h1>'
parser.add_html_to_document(html_string, document)
# Save the document to the in-memory buffer
parser.save(buffer)
# If you need to read from the buffer again after saving,
# you might need to reset its position to the beginning
buffer.seek(0)from html4docx import HtmlToDocx
parser = HtmlToDocx()
parser.parse_html_file(input_html_file_path, output_docx_file_path)
# You can also define a encoding, by default is utf-8
parser.parse_html_file(input_html_file_path, output_docx_file_path, 'utf-8')from html4docx import HtmlToDocx
parser = HtmlToDocx()
docx = parser.parse_html_string(input_html_file_string)Tables are not styled by default. Use the table_style attribute on the parser to set a table style before convert html. The style is used for all tables.
from html4docx import HtmlToDocx
parser = HtmlToDocx()
parser.table_style = 'Light Shading Accent 4'
docx = parser.parse_html_string(input_html_file_string)To add borders to tables, use the Table Grid style:
parser.table_style = 'Table Grid'All table styles we support can be found here.
There is 5 options that you can use to personalize your execution:
- Disable Images: Ignore all images.
- Disable Tables: If you do it, it will render just the raw tables content
- Disable Styles: Ignore all CSS styles. Also disables Style-Map.
- Disable Fix-HTML: Use BeautifulSoap to Fix possible HTML missing tags.
- Disable Style-Map: Ignore CSS classes to word styles mapping
- Disable Tag-Override: Ignore html tag overrides.
- Disable HTML-Comments: Ignore all "" comments from HTML.
This is how you could disable them if you want:
from html4docx import HtmlToDocx
parser = HtmlToDocx()
parser.options['images'] = False # Default True
parser.options['tables'] = False # Default True
parser.options['styles'] = False # Default True
parser.options['fix-html'] = False # Default True
parser.options['html-comments'] = False # Default False
parser.options['style-map'] = False # Default True
parser.options['tag-override'] = False # Default True
docx = parser.parse_html_string(input_html_file_string)Map HTML CSS classes to Word document styles:
from html4docx import HtmlToDocx
style_map = {
'code-block': 'Code Block',
'numbered-heading-1': 'Heading 1 Numbered',
'finding-critical': 'Finding Critical'
}
parser = HtmlToDocx(style_map=style_map)
parser.add_html_to_document(html, document)Override default tag-to-style mappings:
tag_overrides = {
'h1': 'Custom Heading 1', # All <h1> use this style
'pre': 'Code Block' # All <pre> use this style
}
parser = HtmlToDocx(tag_style_overrides=tag_overrides)Set custom default paragraph style:
# Use 'Body' as default (default behavior)
parser = HtmlToDocx(default_paragraph_style='Body')
# Use Word's default 'Normal' style
parser = HtmlToDocx(default_paragraph_style=None)Full support for inline CSS styles on any element:
<p style="color: red; font-size: 14pt">Red 14pt paragraph</p>
<span style="font-weight: bold; color: blue">Bold blue text</span>Supported CSS properties:
- color
- font-size
- font-weight (bold)
- font-style (italic)
- text-decoration (underline, line-through)
- font-family
- text-align
- background-color
- Border (for tables)
- Verticial Align (for tables)
Proper CSS precedence with !important:
<span style="color: gray">
Gray text with <span style="color: red !important">red important</span>.
</span>The !important flag ensures highest priority.
Styles are applied in this order (lowest to highest priority):
- Base HTML tag styles (
<b>,<em>,<code>) - Parent span styles
- CSS class-based styles (from
style_map) - Inline CSS styles (from
styleattribute) - !important inline CSS styles (highest priority)
You're able to read or set docx metadata:
from docx import Document
from html4docx import HtmlToDocx
document = Document()
parser = HtmlToDocx()
parser.set_initial_attrs(document)
metadata = parser.metadata
# You can get metadata as dict
metadata_json = metadata.get_metadata()
print(metadata_json['author']) # Jane
# or just print all metadata if if you want
metadata.get_metadata(print_result=True)
# Set new metadata
metadata.set_metadata(author="Jane", created="2025-07-18T09:30:00")
document.save('your_file_name.docx')You can find all available metadata attributes here.
My goal in forking and fixing/updating this package was to complete my current task at work, which involves converting HTML to DOCX. The original package lacked a few features and had some bugs, preventing me from completing the task. Instead of creating a new package from scratch, I preferred to update this one.
Fixes
- Fix
table_stylenot working | Dfop02 from Issue - Handle missing run for leading br tag | dashingdove from PR
- Fix base64 images | djplaner from Issue
- Handle img tag without src attribute | johnjor from PR
- Fix bug when any style has
!important| Dfop02 - Fix 'style lookup by style_id is deprecated.' | Dfop02
- Fix
background-colornot working | Dfop02 - Fix crashes when img or bookmark is created without paragraph | Dfop02
- Fix Ordered and Unordered Lists | TaylorN15 from PR
- Fixed styles was only being applied to span tag. | Dfop02 from Issue
- Fixed bug on styles parsing when style contains multiple colon. | Dfop02
- Fixed highlighting a single word | Lynuxen
- Fix color parsing failing due to invalid colors, falling back to black. | dfop02 from Issue
New Features
- Add Witdh/Height style to images | maifeeulasad from PR
- Support px, cm, pt, in, rem, em, mm, pc and % units for styles | Dfop02
- Improve performance on large tables | dashingdove from PR
- Support for HTML Pagination | Evilran from PR
- Support Table style | Evilran from PR
- Support alternative encoding | HebaElwazzan from PR
- Support colors by name | Dfop02
- Support font_size when text, ex.: small, medium, etc. | Dfop02
- Support to internal links (Anchor) | Dfop02
- Support to rowspan and colspan in tables. | Dfop02 from Issue
- Support to 'vertical-align' in table cells. | Dfop02
- Support to metadata | Dfop02
- Add support to table cells style (border, background-color, width, height, margin) | Dfop02
- Being able to use inline images on same paragraph. | Dfop02
- Refactory Tests to be more consistent and less 'human validation' | Dfop02
- Support for common CSS properties for text | Lynuxen
- Support for CSS classes to Word Styles | raithedavion
- Support for HTML tag style overrides | raithedavion
These are the ideas I'm planning to work on in the future to make this project even better:
- Add support for the
<style>tag, including all classes, and ensure they are correctly applied throughout the file. - Add support for the
<link>tag to load external CSS and apply it properly across the file.
- Maximum Nesting Depth: Ordered lists support up to 3 nested levels. Any additional depth beyond level 3 will be treated as level 3.
- Counter Reset Behavior:
- At level 1, starting a new ordered list will reset the counter.
- At levels 2 and 3, the counter will continue from the previous item unless explicitly reset.
This project is primarily designed for compatibility with Microsoft Word, but it currently works well with LibreOffice and Google Docs, based on our testing. The goal is to maintain this cross-platform harmony while continuing to implement fixes and updates.
⚠️ However, please note that Microsoft Word is the priority. Bugs or issues specific to other editors (e.g., LibreOffice or Google Docs) may be considered, but fixing them is secondary to maintaining full compatibility with Word.
This project is licensed under the MIT License - see the LICENSE file for details