Fix List Formatting #16

taylorn-ai · 2025-06-02T06:42:27Z

Description

This PR enhances the handling of HTML lists during DOCX generation. Key improvements include:

✅ List Style Improvements

Introduced support for depth-based list styles:
- ul, ul2, ul3 → List Bullet, List Bullet 2, List Bullet 3
- ol, ol2, ol3 → List Number, List Number 2, List Number 3
Automatically applies correct styles based on nesting depth (capped at 3 levels).
Defaults to List Bullet when no wrapping tag is present.

🛠 Bug Fixes & Edge Case Handling

Fixes orphaned <li> tags (those not wrapped in <ul> or <ol>) by:
- Wrapping them in a new <ul> before parsing.
Prevents incorrect blank paragraphs when block-level tags (e.g. <p>) appear inside <li>.

🧹 Code Cleanup

Removed unused imports: base64, urllib, BytesIO, Inches.
Refactored handle_li() for improved clarity and maintainability.
Added in_li flag to track context inside list items.

Checklist Before Requesting a Review

I have performed a self-review of my code.
My code follows the project's coding style and guidelines.
I have run tests and verified that all existing and new tests pass.
I have added new tests to cover my changes.

- 🐛 Fix orphaned <li> tags by wrapping them in <ul> or <ol> if no parent is found - 🎨 Update list styles to support nested lists with specific styles for depth - ✏️ Modify paragraph handling to ensure proper formatting within list items - 📄 Add new test cases for nested list items in HTML input

- 🧹 Remove unused imports for base64, urllib, and BytesIO

dfop02 · 2025-06-02T17:01:25Z

Hello @TaylorN15 ! I'll take a look and review your PR soon, thanks for your support!

html4docx/h4d.py

dfop02 · 2025-06-04T20:44:42Z

Hello @TaylorN15 ! I did a review over your PR, unfortunately the same issue about ordered list counter still happens, but I like your approach, maybe you could find a way to solve this issue.

taylorn-ai · 2025-06-05T00:43:55Z

Hello @TaylorN15 ! I did a review over your PR, unfortunately the same issue about ordered list counter still happens, but I like your approach, maybe you could find a way to solve this issue.

I just tested with your example HTML, and the numbering looks different

I have a fix for restarting the list numbering, it's a bit hacky as you need to inject a numId into the XML

def restart_numbering(paragraph, num_id=1000):
    """Forces Word to treat this paragraph as the start of a new numbered list."""
    p = paragraph._p
    pPr = p.get_or_add_pPr()

    numPr = OxmlElement('w:numPr')

    numId = OxmlElement('w:numId')
    numId.set(qn('w:val'), str(num_id))
    numPr.append(numId)

    ilvl = OxmlElement('w:ilvl')
    ilvl.set(qn('w:val'), '0')  # top-level list
    numPr.append(ilvl)

    pPr.append(numPr)

This solves the numbering issue:

There is still an issue with the nested lists, because one of them is wrapped inside <li> it adds the new number (second list #3), then begins the new <ol>. I can fix this also by checking if the list only contains a child list and not creating a new list item, but that's getting a bit fiddly. Technically the HTML is valid, but it's a bit odd.

<ol>
  <li>first list, first level, item 1</li>
  <li>first list, first level, item 2</li>
  <li>first list, first level, item 3</li>
</ol>

<ol>
  <li>second list, first level, item 1</li>
  <li>
    second list, first level, item 2
    <ol>
      <li>second list, second level, item 1</li>
      <li>second list, second level, item 2</li>
      <li>
        <ol>
          <li>second list, third level, item 1</li>
          <li>second list, third level, item 2</li>
        </ol>
      </li>
      <li>
        <ul>
          <li>third level, unsorted list, item 1</li>
          <li>third level, unsorted list, item 2</li>
        </ul>
      </li>
    </ol>
  </li>
</ol>

If I change to the below, it works as expected:

<ol>
  <li>first list, first level, item 1</li>
  <li>first list, first level, item 2</li>
  <li>first list, first level, item 3</li>
</ol>

<ol>
  <li>second list, first level, item 1</li>
  <li>
    second list, first level, item 2
    <ol>
      <li>second list, second level, item 1</li>
      <li>second list, second level, item 2</li>

      <ol>
        <li>second list, third level, item 1</li>
        <li>second list, third level, item 2</li>
      </ol>

      <ul>
        <li>third level, unsorted list, item 1</li>
        <li>third level, unsorted list, item 2</li>
      </ul>
    </ol>
  </li>
</ol>

What do you think?

- 🐛 Fix list numbering for nested ordered lists - ✨ Add support for multiple levels of ordered and unordered lists - 🔧 Refactor list handling logic for improved clarity and functionality - 📜 Update test HTML to validate new list features

dfop02 · 2025-06-06T19:18:16Z

Hello @TaylorN15 ! I did a review over your PR, unfortunately the same issue about ordered list counter still happens, but I like your approach, maybe you could find a way to solve this issue.

I just tested with your example HTML, and the numbering looks different

I have a fix for restarting the list numbering, it's a bit hacky as you need to inject a numId into the XML

def restart_numbering(paragraph, num_id=1000):
    """Forces Word to treat this paragraph as the start of a new numbered list."""
    p = paragraph._p
    pPr = p.get_or_add_pPr()

    numPr = OxmlElement('w:numPr')

    numId = OxmlElement('w:numId')
    numId.set(qn('w:val'), str(num_id))
    numPr.append(numId)

    ilvl = OxmlElement('w:ilvl')
    ilvl.set(qn('w:val'), '0')  # top-level list
    numPr.append(ilvl)

    pPr.append(numPr)

This solves the numbering issue:

There is still an issue with the nested lists, because one of them is wrapped inside <li> it adds the new number (second list #3), then begins the new <ol>. I can fix this also by checking if the list only contains a child list and not creating a new list item, but that's getting a bit fiddly. Technically the HTML is valid, but it's a bit odd.

<ol>
  <li>first list, first level, item 1</li>
  <li>first list, first level, item 2</li>
  <li>first list, first level, item 3</li>
</ol>

<ol>
  <li>second list, first level, item 1</li>
  <li>
    second list, first level, item 2
    <ol>
      <li>second list, second level, item 1</li>
      <li>second list, second level, item 2</li>
      <li>
        <ol>
          <li>second list, third level, item 1</li>
          <li>second list, third level, item 2</li>
        </ol>
      </li>
      <li>
        <ul>
          <li>third level, unsorted list, item 1</li>
          <li>third level, unsorted list, item 2</li>
        </ul>
      </li>
    </ol>
  </li>
</ol>

If I change to the below, it works as expected:

<ol>
  <li>first list, first level, item 1</li>
  <li>first list, first level, item 2</li>
  <li>first list, first level, item 3</li>
</ol>

<ol>
  <li>second list, first level, item 1</li>
  <li>
    second list, first level, item 2
    <ol>
      <li>second list, second level, item 1</li>
      <li>second list, second level, item 2</li>

      <ol>
        <li>second list, third level, item 1</li>
        <li>second list, third level, item 2</li>
      </ol>

      <ul>
        <li>third level, unsorted list, item 1</li>
        <li>third level, unsorted list, item 2</li>
      </ul>
    </ol>
  </li>
</ol>

What do you think?

Ok, so I made some tests here to make sure the issues I have found.

First, I still have problems on order count... but maybe because I'm using LibreOffice for my tests, but I don't have a Word here to validate this.

I also tested on Google Docs, and was even worst than before

I believe there is something we could fix by understanding the counting order and try to "hack fix" this context, what you think? Try to reproduce on other editors like Google Docs to see what happens.

taylorn-ai · 2025-06-06T20:11:54Z

That’s odd, however, python-docx was designed to work specifically with Microsoft Word documents, making it work with all other applications will be difficult.

The numbered list order seems to be working fine in my Word documents, as it uses OXmlElement, I don’t know if LibreOffice and Google Docs work 100% with OpenXML based documents.

I can have a look into this a bit further, and I’ll see what I can do. But ultimately, as I said, it was designed to work with MS Word only.

dfop02 · 2025-06-10T14:47:12Z

That’s odd, however, python-docx was designed to work specifically with Microsoft Word documents, making it work with all other applications will be difficult.

The numbered list order seems to be working fine in my Word documents, as it uses OXmlElement, I don’t know if LibreOffice and Google Docs work 100% with OpenXML based documents.

I can have a look into this a bit further, and I’ll see what I can do. But ultimately, as I said, it was designed to work with MS Word only.

Yes, I agree that python-docx was designed specifically for Microsoft Word, but currently, there's good harmony between the applications, with most features working quite similarly. So, I’d like to keep it that way if possible, especially since I'm using LibreOffice for my own tests. If it’s not possible, then we can proceed with a Word-focused solution.

On that note, I ran a few new tests using your branch as inspiration and managed to get it working on LibreOffice as well, aside from one small issue (which you might be able to help me with).

As you can see, in my version it's working fine in most cases. However, when there's a restart on a nested list, I'm still struggling to get it to reset properly. The restart counter works correctly only at the base level. I tried fixing it, but so far without success. You might have some new ideas that could help me find a solution. I feel like we’re close to solving it completely.

I created a new branch called bug/fix-list-formatting-v2, check the commit here, take a look on my approach, that basically is a merge using your ideas and this StackOverflow Answer as reference.

I would appreciate it if you could test my approach on Microsoft Word to confirm that it’s still working well, and let me know if you have any ideas to fix the remaining issue.

taylorn-ai · 2025-06-10T20:25:38Z

Just to confirm, you're expecting 3. second list, third level, item 1 and 2 to restart? So instead of 1, 2, *, *, 3, 4, it should be 1, 2, *, *, 1, 2? I would imagine we just need to restart the numbering again if we detect a new ol? I will have a look.

taylorn-ai · 2025-06-10T23:15:08Z

OK I have tried so many things, from what i can tell, it's some issue with the abstractNumId. I did get the numbering working, but then all the indentation was broken. Honestly, I think we just leave it here as "good enough" :)

dfop02 · 2025-06-11T13:32:56Z

OK I have tried so many things, from what i can tell, it's some issue with the abstractNumId. I did get the numbering working, but then all the indentation was broken. Honestly, I think we just leave it here as "good enough" :)

Okay, I agree! I may include this as a known issue at least, before release new version I'll update docs with it.
Could you please update your branch with v2 version and include some unit test for this on test/test.py?
I'll take a look again and merge if it's all fine! :)

- 🧪 Implement test for ordered list to verify item presence - 🧪 Implement test for unordered list to verify item presence

- 🐛 Ensure 'href' is checked for None before processing links - 🔗 Update logic to handle cases where 'href' is not provided

taylorn-ai · 2025-06-12T21:51:01Z

I've added a couple of tests, and also a small fix for anchor/hrefs that I noticed was causing issues for me.

dfop02 · 2025-06-13T11:33:42Z

OK I have tried so many things, from what i can tell, it's some issue with the abstractNumId. I did get the numbering working, but then all the indentation was broken. Honestly, I think we just leave it here as "good enough" :)

@TaylorN15 I may have misunderstood. Do you have tested my branch and validate that works fine on Word, but you couldn't fix the issue I explained about, correct?

If so, do you mind updating your branch with my branch bug/fix-formatting-list-v2 because your branch currently only works for Word and mine works on Word, LibrreOffice, and Google Docs. That's what I meant with "update with v2", let me know if it's fine for you.

Also, thanks for find and fix about anchor/link, I appreciate it 🙏

taylorn-ai · 2025-06-13T11:35:40Z

I pulled your branch and added the unit tests but didn't merge back into mine. My bad.

tests/test.py

tests/assets/htmls/text1.html

dfop02 · 2025-06-16T18:04:44Z

@TaylorN15 today I checked your PR, just few adjustments I suggested to you, let me know when you update it!

- 🐛 Remove orphaned <li> tag handling to streamline list processing - ✅ Updated tests for ordered and unordered list scenarios to ensure proper conversion - ✨ Enhance test cases for better coverage of list items and styles

taylorn-ai · 2025-06-16T21:09:59Z

@TaylorN15 today I checked your PR, just few adjustments I suggested to you, let me know when you update it!

All done :)

dfop02 · 2025-06-16T22:37:34Z

Nice @TaylorN15 ! I'll merge it now and release the official version until the end of this week, because I still need update the docs, thanks for your help and support! If you find something else to fix/update let me know :)

taylorn-ai · 2025-06-17T00:46:30Z

Just realised I forgot to run flake8 afterward and there are some unused imports! Normally I use isort & black, but if i ran that over this codebase it would change too much :/

dfop02 · 2025-06-17T12:13:54Z

No worries, I'll update and let everything working fine before launch the new version, thx for let me know :)

Taylor added 2 commits June 2, 2025 16:33

Refactor imports in HTML to DOCX conversion

c3a52a1

- 🧹 Remove unused imports for base64, urllib, and BytesIO

dfop02 self-requested a review June 2, 2025 16:59

dfop02 self-assigned this Jun 2, 2025

dfop02 added bug Something isn't working New Feature New feature or request good first issue Good for newcomers labels Jun 2, 2025

dfop02 changed the base branch from main to release/1.0.7 June 4, 2025 19:52

dfop02 reviewed Jun 4, 2025

View reviewed changes

html4docx/h4d.py Outdated Show resolved Hide resolved

dfop02 reviewed Jun 4, 2025

View reviewed changes

html4docx/h4d.py Outdated Show resolved Hide resolved

dfop02 reviewed Jun 4, 2025

View reviewed changes

html4docx/h4d.py Outdated Show resolved Hide resolved

My approach to fix numeric list

525fc84

Taylor added 2 commits June 13, 2025 07:49

Add tests for ordered and unordered list handling

6d8422d

- 🧪 Implement test for ordered list to verify item presence - 🧪 Implement test for unordered list to verify item presence

Fix link handling to prevent errors with missing href

d8884ff

- 🐛 Ensure 'href' is checked for None before processing links - 🔗 Update logic to handle cases where 'href' is not provided

dfop02 reviewed Jun 16, 2025

View reviewed changes

tests/test.py Outdated Show resolved Hide resolved

dfop02 reviewed Jun 16, 2025

View reviewed changes

tests/test.py Show resolved Hide resolved

dfop02 reviewed Jun 16, 2025

View reviewed changes

tests/assets/htmls/text1.html Outdated Show resolved Hide resolved

Refactor list handling in HTML to DOCX conversion

4fba4b4

- 🐛 Remove orphaned <li> tag handling to streamline list processing - ✅ Updated tests for ordered and unordered list scenarios to ensure proper conversion - ✨ Enhance test cases for better coverage of list items and styles

dfop02 approved these changes Jun 16, 2025

View reviewed changes

dfop02 merged commit f5213c2 into dfop02:release/1.0.7 Jun 16, 2025
5 checks passed

dfop02 mentioned this pull request Jun 17, 2025

Release/1.0.7 #17

Merged

taylorn-ai deleted the bug/fix-list-formatting branch June 17, 2025 20:58

Uh oh!

Fix List Formatting #16

Fix List Formatting #16

Uh oh!

Conversation

taylorn-ai commented Jun 2, 2025

Description

✅ List Style Improvements

🛠 Bug Fixes & Edge Case Handling

🧹 Code Cleanup

Checklist Before Requesting a Review

Uh oh!

dfop02 commented Jun 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dfop02 commented Jun 4, 2025

Uh oh!

taylorn-ai commented Jun 5, 2025

Uh oh!

dfop02 commented Jun 6, 2025

Uh oh!

taylorn-ai commented Jun 6, 2025

Uh oh!

dfop02 commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taylorn-ai commented Jun 10, 2025

Uh oh!

taylorn-ai commented Jun 10, 2025

Uh oh!

dfop02 commented Jun 11, 2025

Uh oh!

taylorn-ai commented Jun 12, 2025

Uh oh!

dfop02 commented Jun 13, 2025

Uh oh!

taylorn-ai commented Jun 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dfop02 commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taylorn-ai commented Jun 16, 2025

Uh oh!

dfop02 commented Jun 16, 2025

Uh oh!

Uh oh!

taylorn-ai commented Jun 17, 2025

Uh oh!

dfop02 commented Jun 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dfop02 commented Jun 10, 2025 •

edited

Loading

dfop02 commented Jun 16, 2025 •

edited

Loading