Skip to content

Fix for insufficient nested lists markdown indentation #289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 8, 2025

Conversation

pcraig3
Copy link
Collaborator

@pcraig3 pcraig3 commented Apr 8, 2025

Starting since I last updated markdownify, we have seen problems with our nested lists being interpreted correctly in markdown.

The issue appears to be that instead of using tabs as the indentation for nested lists, the new update has been using 2 or 3 spaces, depending on whether the list was a UL or an OL.

So for this UL as HTML:

<ul>
  <li>
    Unordered list level 1
    <ul>
      <li>
        Unordered list level 2
        <ul>
          <li>
            Unordered list level 3
            <ul>
              <li>Unordered list level 4</li>
            </ul>
          </li>
          <li>Unordered list level 3</li>
        </ul>
      </li>
      <li>Unordered list level 2</li>
    </ul>
  </li>
  <li>Unordered list level 1</li>
</ul>

this is what we would get as the markdown string:

* Unordered list level 1
  + Unordered list level 2
    - Unordered list level 3
      * Unordered list level 4
    - Unordered list level 3
  + Unordered list level 2
* Unordered list level 1

Looks basically right, except when we re-interpreted this using the Markdown library, it would only recognize 4 spaces as a nested list, not 2 spaces as we have here.

So once we re-converted this to HTML, it would look like this:

* Unordered list level 1
* Unordered list level 2
    - Unordered list level 3
    - Unordered list level 4
    - Unordered list level 3
* Unordered list level 2
* Unordered list level 1

That is, the first and second levels were both root level lists, and the 3rd and 4th levels were demoted to level 1.

Similarly, for our ordered lists, we would see something similar.

<ol>
  <li>
    Ordered list level 1
    <ol>
      <li>
        Ordered list level 2
        <ol>
          <li>
            Ordered list level 3
            <ol>
              <li>Ordered list level 4</li>
            </ol>
          </li>
          <li>Ordered list level 3</li>
        </ol>
      </li>
      <li>Ordered list level 2</li>
    </ol>
  </li>
  <li>Ordered list level 1</li>
</ol>

as markdown:

1. Ordered list level 1
   1. Ordered list level 2
      1. Ordered list level 3
         1. Ordered list level 4
      2. Ordered list level 3
   2. Ordered list level 2
2. Ordered list level 1

Once re-interpreted to markdown, we were getting a result like this:

1. Ordered list level 1
2. Ordered list level 2
     1. Ordered list level 3
         1. Ordered list level 4
     2. Ordered list level 3
3. Ordered list level 2
4. Ordered list level 1

Again, not what we want.

So this update forces all the indents to be 4 spaces, not 2 or 3 as it was previously.

Here is the commit that changed this behaviour (which we are partially undoing here): matthewwithanm/python-markdownify@c13bdd5

Screenshots

This is what it looked like to import a Document with the following HTML before and after this change:

example as unstyled HTML

Screenshot 2025-04-08 at 6 27 47 PM

rendered in the app

UL before UL after
Screenshot 2025-04-08 at 6 25 17 PM Screenshot 2025-04-08 at 6 26 17 PM
OL before OL after
Screenshot 2025-04-08 at 6 25 31 PM Screenshot 2025-04-08 at 6 26 29 PM

Note that the first OL has 3 levels because the indent was 3 spaces, and 3 * 3 is 9. Since nested lists are only recognized at intervals of 4, 9 is > 8.

Starting since I last updated this library, we have seen problems with
our nested lists being interpreted correctly in markdown.

PR: #267

The issue appears to be that instead of using tabs as the indentation
for nested lists, the new update has been using  2 or 3 spaces, depending
on whether the list was a UL or an OL.

So for this UL as HTML:

```
<ul>
  <li>
    Unordered list level 1
    <ul>
      <li>
        Unordered list level 2
        <ul>
          <li>
            Unordered list level 3
            <ul>
              <li>Unordered list level 4</li>
            </ul>
          </li>
          <li>Unordered list level 3</li>
        </ul>
      </li>
      <li>Unordered list level 2</li>
    </ul>
  </li>
  <li>Unordered list level 1</li>
</ul>
```

this is what we would get as the markdown string:

```
* Unordered list level 1
  + Unordered list level 2
    - Unordered list level 3
      * Unordered list level 4
    - Unordered list level 3
  + Unordered list level 2
* Unordered list level 1
```

Looks basically right, except when we re-interpreted this using the Markdown
library, it would only recognize 4 spaces as a nested list, not 2 spaces
as we have here.

So once we re-converted this to HTML, it would look like this:

```
* Unordered list level 1
* Unordered list level 2
    - Unordered list level 3
    - Unordered list level 4
    - Unordered list level 3
* Unordered list level 2
* Unordered list level 1
```

That is, the first and second levels were both root level lists, and
the 3rd and 4th levels were demoted to level 1.

Similarly, for our ordered lists, we would see something similar.

```
<ol>
  <li>
    Ordered list level 1
    <ol>
      <li>
        Ordered list level 2
        <ol>
          <li>
            Ordered list level 3
            <ol>
              <li>Ordered list level 4</li>
            </ol>
          </li>
          <li>Ordered list level 3</li>
        </ol>
      </li>
      <li>Ordered list level 2</li>
    </ol>
  </li>
  <li>Ordered list level 1</li>
</ol>
```

as markdown:

```
1. Ordered list level 1
   1. Ordered list level 2
      1. Ordered list level 3
         1. Ordered list level 4
      2. Ordered list level 3
   2. Ordered list level 2
2. Ordered list level 1
```

Once re-interpreted to markdown, we were getting a result like this:

```
1. Ordered list level 1
2. Ordered list level 2
     1. Ordered list level 3
         1. Ordered list level 4
     2. Ordered list level 3
3. Ordered list level 2
4. Ordered list level 1
```

Again, not what we want.

So this update forces all the indents to be 4 spaces, not
2 or 3 as it was previously.

Here is the commit that changed this behaviour (which we are
partially undoing here):

matthewwithanm/python-markdownify@c13bdd5
@pcraig3 pcraig3 merged commit 7989c34 into main Apr 8, 2025
4 checks passed
@pcraig3 pcraig3 deleted the fix-ul-ol-indent branch April 8, 2025 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant