Skip to content

Conversation

@taylorn-ai
Copy link
Contributor

Description

This PR enhances the handling of HTML lists during DOCX generation. Key improvements include:

✅ List Style Improvements

  • Introduced support for depth-based list styles:
    • ul, ul2, ul3List Bullet, List Bullet 2, List Bullet 3
    • ol, ol2, ol3List Number, List Number 2, List Number 3
  • Automatically applies correct styles based on nesting depth (capped at 3 levels).
  • Defaults to List Bullet when no wrapping tag is present.

🛠 Bug Fixes & Edge Case Handling

  • Fixes orphaned <li> tags (those not wrapped in <ul> or <ol>) by:
    • Wrapping them in a new <ul> before parsing.
  • Prevents incorrect blank paragraphs when block-level tags (e.g. <p>) appear inside <li>.

🧹 Code Cleanup

  • Removed unused imports: base64, urllib, BytesIO, Inches.
  • Refactored handle_li() for improved clarity and maintainability.
  • Added in_li flag to track context inside list items.

Checklist Before Requesting a Review

  • I have performed a self-review of my code.
  • My code follows the project's coding style and guidelines.
  • I have run tests and verified that all existing and new tests pass.
  • I have added new tests to cover my changes.

Taylor added 2 commits June 2, 2025 16:33
- 🐛 Fix orphaned <li> tags by wrapping them in <ul> or <ol> if no parent is found
- 🎨 Update list styles to support nested lists with specific styles for depth
- ✏️ Modify paragraph handling to ensure proper formatting within list items
- 📄 Add new test cases for nested list items in HTML input
- 🧹 Remove unused imports for base64, urllib, and BytesIO
@dfop02 dfop02 self-requested a review June 2, 2025 16:59
@dfop02 dfop02 self-assigned this Jun 2, 2025
@dfop02 dfop02 added bug Something isn't working New Feature New feature or request good first issue Good for newcomers labels Jun 2, 2025
@dfop02
Copy link
Owner

dfop02 commented Jun 2, 2025

Hello @TaylorN15 ! I'll take a look and review your PR soon, thanks for your support!

@dfop02 dfop02 changed the base branch from main to release/1.0.7 June 4, 2025 19:52
@dfop02
Copy link
Owner

dfop02 commented Jun 4, 2025

Hello @TaylorN15 ! I did a review over your PR, unfortunately the same issue about ordered list counter still happens, but I like your approach, maybe you could find a way to solve this issue.
Screenshot 2025-06-04 at 17 44 23

@taylorn-ai
Copy link
Contributor Author

Hello @TaylorN15 ! I did a review over your PR, unfortunately the same issue about ordered list counter still happens, but I like your approach, maybe you could find a way to solve this issue.

I just tested with your example HTML, and the numbering looks different
image

I have a fix for restarting the list numbering, it's a bit hacky as you need to inject a numId into the XML

def restart_numbering(paragraph, num_id=1000):
    """Forces Word to treat this paragraph as the start of a new numbered list."""
    p = paragraph._p
    pPr = p.get_or_add_pPr()

    numPr = OxmlElement('w:numPr')

    numId = OxmlElement('w:numId')
    numId.set(qn('w:val'), str(num_id))
    numPr.append(numId)

    ilvl = OxmlElement('w:ilvl')
    ilvl.set(qn('w:val'), '0')  # top-level list
    numPr.append(ilvl)

    pPr.append(numPr)

This solves the numbering issue:
image

There is still an issue with the nested lists, because one of them is wrapped inside <li> it adds the new number (second list #3), then begins the new <ol>. I can fix this also by checking if the list only contains a child list and not creating a new list item, but that's getting a bit fiddly. Technically the HTML is valid, but it's a bit odd.

<ol>
  <li>first list, first level, item 1</li>
  <li>first list, first level, item 2</li>
  <li>first list, first level, item 3</li>
</ol>

<ol>
  <li>second list, first level, item 1</li>
  <li>
    second list, first level, item 2
    <ol>
      <li>second list, second level, item 1</li>
      <li>second list, second level, item 2</li>
      <li>
        <ol>
          <li>second list, third level, item 1</li>
          <li>second list, third level, item 2</li>
        </ol>
      </li>
      <li>
        <ul>
          <li>third level, unsorted list, item 1</li>
          <li>third level, unsorted list, item 2</li>
        </ul>
      </li>
    </ol>
  </li>
</ol>

If I change to the below, it works as expected:

<ol>
  <li>first list, first level, item 1</li>
  <li>first list, first level, item 2</li>
  <li>first list, first level, item 3</li>
</ol>

<ol>
  <li>second list, first level, item 1</li>
  <li>
    second list, first level, item 2
    <ol>
      <li>second list, second level, item 1</li>
      <li>second list, second level, item 2</li>

      <ol>
        <li>second list, third level, item 1</li>
        <li>second list, third level, item 2</li>
      </ol>

      <ul>
        <li>third level, unsorted list, item 1</li>
        <li>third level, unsorted list, item 2</li>
      </ul>
    </ol>
  </li>
</ol>

What do you think?

- 🐛 Fix list numbering for nested ordered lists
- ✨ Add support for multiple levels of ordered and unordered lists
- 🔧 Refactor list handling logic for improved clarity and functionality
- 📜 Update test HTML to validate new list features
@dfop02
Copy link
Owner

dfop02 commented Jun 6, 2025

Hello @TaylorN15 ! I did a review over your PR, unfortunately the same issue about ordered list counter still happens, but I like your approach, maybe you could find a way to solve this issue.

I just tested with your example HTML, and the numbering looks different image

I have a fix for restarting the list numbering, it's a bit hacky as you need to inject a numId into the XML

def restart_numbering(paragraph, num_id=1000):
    """Forces Word to treat this paragraph as the start of a new numbered list."""
    p = paragraph._p
    pPr = p.get_or_add_pPr()

    numPr = OxmlElement('w:numPr')

    numId = OxmlElement('w:numId')
    numId.set(qn('w:val'), str(num_id))
    numPr.append(numId)

    ilvl = OxmlElement('w:ilvl')
    ilvl.set(qn('w:val'), '0')  # top-level list
    numPr.append(ilvl)

    pPr.append(numPr)

This solves the numbering issue: image

There is still an issue with the nested lists, because one of them is wrapped inside <li> it adds the new number (second list #3), then begins the new <ol>. I can fix this also by checking if the list only contains a child list and not creating a new list item, but that's getting a bit fiddly. Technically the HTML is valid, but it's a bit odd.

<ol>
  <li>first list, first level, item 1</li>
  <li>first list, first level, item 2</li>
  <li>first list, first level, item 3</li>
</ol>

<ol>
  <li>second list, first level, item 1</li>
  <li>
    second list, first level, item 2
    <ol>
      <li>second list, second level, item 1</li>
      <li>second list, second level, item 2</li>
      <li>
        <ol>
          <li>second list, third level, item 1</li>
          <li>second list, third level, item 2</li>
        </ol>
      </li>
      <li>
        <ul>
          <li>third level, unsorted list, item 1</li>
          <li>third level, unsorted list, item 2</li>
        </ul>
      </li>
    </ol>
  </li>
</ol>

If I change to the below, it works as expected:

<ol>
  <li>first list, first level, item 1</li>
  <li>first list, first level, item 2</li>
  <li>first list, first level, item 3</li>
</ol>

<ol>
  <li>second list, first level, item 1</li>
  <li>
    second list, first level, item 2
    <ol>
      <li>second list, second level, item 1</li>
      <li>second list, second level, item 2</li>

      <ol>
        <li>second list, third level, item 1</li>
        <li>second list, third level, item 2</li>
      </ol>

      <ul>
        <li>third level, unsorted list, item 1</li>
        <li>third level, unsorted list, item 2</li>
      </ul>
    </ol>
  </li>
</ol>

What do you think?

Ok, so I made some tests here to make sure the issues I have found.

First, I still have problems on order count... but maybe because I'm using LibreOffice for my tests, but I don't have a Word here to validate this.
Screenshot 2025-06-06 at 16 04 45

I also tested on Google Docs, and was even worst than before
Screenshot 2025-06-06 at 16 04 59

I believe there is something we could fix by understanding the counting order and try to "hack fix" this context, what you think? Try to reproduce on other editors like Google Docs to see what happens.

@taylorn-ai
Copy link
Contributor Author

That’s odd, however, python-docx was designed to work specifically with Microsoft Word documents, making it work with all other applications will be difficult.

The numbered list order seems to be working fine in my Word documents, as it uses OXmlElement, I don’t know if LibreOffice and Google Docs work 100% with OpenXML based documents.

I can have a look into this a bit further, and I’ll see what I can do. But ultimately, as I said, it was designed to work with MS Word only.

@dfop02
Copy link
Owner

dfop02 commented Jun 10, 2025

That’s odd, however, python-docx was designed to work specifically with Microsoft Word documents, making it work with all other applications will be difficult.

The numbered list order seems to be working fine in my Word documents, as it uses OXmlElement, I don’t know if LibreOffice and Google Docs work 100% with OpenXML based documents.

I can have a look into this a bit further, and I’ll see what I can do. But ultimately, as I said, it was designed to work with MS Word only.

Yes, I agree that python-docx was designed specifically for Microsoft Word, but currently, there's good harmony between the applications, with most features working quite similarly. So, I’d like to keep it that way if possible, especially since I'm using LibreOffice for my own tests. If it’s not possible, then we can proceed with a Word-focused solution.

On that note, I ran a few new tests using your branch as inspiration and managed to get it working on LibreOffice as well, aside from one small issue (which you might be able to help me with).

Screenshot 2025-06-10 at 11 34 27

As you can see, in my version it's working fine in most cases. However, when there's a restart on a nested list, I'm still struggling to get it to reset properly. The restart counter works correctly only at the base level. I tried fixing it, but so far without success. You might have some new ideas that could help me find a solution. I feel like we’re close to solving it completely.

I created a new branch called bug/fix-list-formatting-v2, check the commit here, take a look on my approach, that basically is a merge using your ideas and this StackOverflow Answer as reference.

I would appreciate it if you could test my approach on Microsoft Word to confirm that it’s still working well, and let me know if you have any ideas to fix the remaining issue.

@taylorn-ai
Copy link
Contributor Author

Just to confirm, you're expecting 3. second list, third level, item 1 and 2 to restart? So instead of 1, 2, *, *, 3, 4, it should be 1, 2, *, *, 1, 2? I would imagine we just need to restart the numbering again if we detect a new ol? I will have a look.

@taylorn-ai
Copy link
Contributor Author

OK I have tried so many things, from what i can tell, it's some issue with the abstractNumId. I did get the numbering working, but then all the indentation was broken. Honestly, I think we just leave it here as "good enough" :)

@dfop02
Copy link
Owner

dfop02 commented Jun 11, 2025

OK I have tried so many things, from what i can tell, it's some issue with the abstractNumId. I did get the numbering working, but then all the indentation was broken. Honestly, I think we just leave it here as "good enough" :)

Okay, I agree! I may include this as a known issue at least, before release new version I'll update docs with it.
Could you please update your branch with v2 version and include some unit test for this on test/test.py?
I'll take a look again and merge if it's all fine! :)

Taylor added 2 commits June 13, 2025 07:49
- 🧪 Implement test for ordered list to verify item presence
- 🧪 Implement test for unordered list to verify item presence
- 🐛 Ensure 'href' is checked for None before processing links
- 🔗 Update logic to handle cases where 'href' is not provided
@taylorn-ai
Copy link
Contributor Author

I've added a couple of tests, and also a small fix for anchor/hrefs that I noticed was causing issues for me.

@dfop02
Copy link
Owner

dfop02 commented Jun 13, 2025

OK I have tried so many things, from what i can tell, it's some issue with the abstractNumId. I did get the numbering working, but then all the indentation was broken. Honestly, I think we just leave it here as "good enough" :)

@TaylorN15 I may have misunderstood. Do you have tested my branch and validate that works fine on Word, but you couldn't fix the issue I explained about, correct?

If so, do you mind updating your branch with my branch bug/fix-formatting-list-v2 because your branch currently only works for Word and mine works on Word, LibrreOffice, and Google Docs. That's what I meant with "update with v2", let me know if it's fine for you.

Also, thanks for find and fix about anchor/link, I appreciate it 🙏

@taylorn-ai
Copy link
Contributor Author

I pulled your branch and added the unit tests but didn't merge back into mine. My bad.

@dfop02
Copy link
Owner

dfop02 commented Jun 16, 2025

@TaylorN15 today I checked your PR, just few adjustments I suggested to you, let me know when you update it!

- 🐛 Remove orphaned <li> tag handling to streamline list processing
- ✅ Updated tests for ordered and unordered list scenarios to ensure proper conversion
- ✨ Enhance test cases for better coverage of list items and styles
@taylorn-ai
Copy link
Contributor Author

@TaylorN15 today I checked your PR, just few adjustments I suggested to you, let me know when you update it!

All done :)

@dfop02
Copy link
Owner

dfop02 commented Jun 16, 2025

Nice @TaylorN15 ! I'll merge it now and release the official version until the end of this week, because I still need update the docs, thanks for your help and support! If you find something else to fix/update let me know :)

@dfop02 dfop02 merged commit f5213c2 into dfop02:release/1.0.7 Jun 16, 2025
5 checks passed
@taylorn-ai
Copy link
Contributor Author

Just realised I forgot to run flake8 afterward and there are some unused imports! Normally I use isort & black, but if i ran that over this codebase it would change too much :/

@dfop02
Copy link
Owner

dfop02 commented Jun 17, 2025

No worries, I'll update and let everything working fine before launch the new version, thx for let me know :)

@dfop02 dfop02 mentioned this pull request Jun 17, 2025
@taylorn-ai taylorn-ai deleted the bug/fix-list-formatting branch June 17, 2025 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working good first issue Good for newcomers New Feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants