List items
==========
Problems
--------
We are interested in sublists indentation rules for all types of lists, and
integer overflows in case of ordered lists.
For ordered lists, we are not concerned about using ``0`` or negative numbers
as list markers so these cases will not be considered. Infact ordred lists
generated by md-toc will always start from ``1``.
Talking about indentation rules, I need to mention that the user is responsible
for generating a correct markdown list according to the parser's rules. Let's
see this example:
.. code-block:: markdown
# foo
## bar
### baz
no problem here because this is rendered by md-toc, using ``github`` as parser,
with:
.. code-block:: markdown
- [foo](#foo)
- [bar](#bar)
- [baz](#baz)
Now, let's take the previous example and reverse the order of the lines:
.. code-block:: markdown
::
### baz
## bar
# foo
and this is what md-toc renders using ``github``:
.. code-block:: markdown
- [baz](#baz)
- [foo](#foo)
- [bar](#bar)
while the user might expect this:
.. code-block:: markdown
- [baz](#baz)
- [foo](#foo)
- [bar](#bar)
Indentation
-----------
``cmark``, ``github``, ``gitlab``
`````````````````````````````````
List indentation for sublists with this parser is based on the
previous state, as stated in the Commonmark spec, at
section 5.2:
"The most important thing to notice is that the position of the text after the
list marker determines how much indentation is needed in subsequent blocks in
the list item. If the list marker takes up two spaces of indentation,
and there are three spaces between the list marker and the next character
other than a space or tab, then blocks must be indented five spaces in order
to fall under the list item."
- https://spec.commonmark.org/0.30/#list-items
This is also true with the specular case: if our new list element needs less
indentation than the one processed currently, we have to use the same number
of indentation spaces used somewhere earlier in the list.
``redcarpet``
`````````````
- https://github.com/vmg/redcarpet/blob/6270d6b4ab6b46ee6bb57a6c0e4b2377c01780ae/ext/redcarpet/markdown.c#L1553
- https://github.com/vmg/redcarpet/blob/6270d6b4ab6b46ee6bb57a6c0e4b2377c01780ae/ext/redcarpet/markdown.c#L1528
The following C function returns the first non-whitespace character
after the list marker. The value of ``0`` is returned if the input
line is not a list element. List item rules are explained in the
single line comments.
.. code-block:: c
:linenos:
/* prefix_uli • returns unordered list item prefix */
static size_t
prefix_uli(uint8_t *data, size_t size)
{
size_t i = 0;
// There can be up to 3 whitespaces before the list marker.
if (i < size && data[i] == ' ') i++;
if (i < size && data[i] == ' ') i++;
if (i < size && data[i] == ' ') i++;
// The next non-whitespace character must be a list marker and
// the character after the list marker must be a whitespace.
if (i + 1 >= size ||
(data[i] != '*' && data[i] != '+' && data[i] != '-') ||
data[i + 1] != ' ')
return 0;
// Check that the next line is not a header
// that uses the `-` or `=` characters as markers.
if (is_next_headerline(data + i, size - i))
return 0;
// Return the first non whitespace character after the list marker.
return i + 2;
}
As far as I can tell from the previous and other functions, on a new list
block the 4 spaces indentation rule applies:
- https://github.com/vmg/redcarpet/blob/6270d6b4ab6b46ee6bb57a6c0e4b2377c01780ae/ext/redcarpet/markdown.c#L1822
- https://github.com/vmg/redcarpet/blob/6270d6b4ab6b46ee6bb57a6c0e4b2377c01780ae/ext/redcarpet/markdown.c#L1873
This means that anything that has more than 3 whitespaces is considered as
sublist. The only exception seems to be for the first sublist in a list
block, in which that case even a single whitespace counts as a sublist.
The 4 spaces indentation rule appllies nontheless, so to keep things simple
md-toc will always use 4 whitespaces for sublists. Apparently, ordered and
unordered lists share the same proprieties.
Let's see this example:
::
- I
- am
- foo
stop
- I
- am
- foo
This is how redcarpet renders it once you run ``$ redcarpet``:
.. code-block:: html
stop
What follows is an extract of a C function in redcarpet that parses list
items. I have added all the single line comments.
.. code-block:: c
:linenos:
/* parse_listitem • parsing of a single list item */
/* assuming initial prefix is already removed */
static size_t
parse_listitem(struct buf *ob, struct sd_markdown *rndr, uint8_t *data,
size_t size, int *flags)
{
struct buf *work = 0, *inter = 0;
size_t beg = 0, end, pre, sublist = 0, orgpre = 0, i;
int in_empty = 0, has_inside_empty = 0, in_fence = 0;
// This is the base case, usually of indentation 0 but it can be
// from 0 to 3 spaces. If it was 4 spaces it would be a code
// block.
/* keeping track of the first indentation prefix */
while (orgpre < 3 && orgpre < size && data[orgpre] == ' ')
orgpre++;
// Get the first index of string after the list marker. Try both
// ordered and unordered lists
beg = prefix_uli(data, size);
if (!beg)
beg = prefix_oli(data, size);
if (!beg)
return 0;
/* skipping to the beginning of the following line */
end = beg;
while (end < size && data[end - 1] != '\n')
end++;
// Iterate line by line using the '\n' character as delimiter.
/* process the following lines */
while (beg < size) {
size_t has_next_uli = 0, has_next_oli = 0;
// Go to the next line.
end++;
// Find the end of the line.
while (end < size && data[end - 1] != '\n')
end++;
// Skip the next line if it is empty.
/* process an empty line */
if (is_empty(data + beg, end - beg)) {
in_empty = 1;
beg = end;
continue;
}
// Count up to 4 characters of indentation.
// If we have 4 characters then it might be a sublist.
// Note that this is an offset and does not point to an
// index in the actual line string.
/* calculating the indentation */
i = 0;
while (i < 4 && beg + i < end && data[beg + i] == ' ')
i++;
pre = i;
/* Only check for new list items if we are **not** inside
* a fenced code block */
if (!in_fence) {
has_next_uli = prefix_uli(data + beg + i, end - beg - i);
has_next_oli = prefix_oli(data + beg + i, end - beg - i);
}
/* checking for ul/ol switch */
if (in_empty && (
((*flags & MKD_LIST_ORDERED) && has_next_uli) ||
(!(*flags & MKD_LIST_ORDERED) && has_next_oli))){
*flags |= MKD_LI_END;
break; /* the following item must have same list type */
}
// Determine if we are dealing with:
// - an empty line
// - a new list item
// - a sublist
/* checking for a new item */
if ((has_next_uli && !is_hrule(data + beg + i, end - beg - i)) || has_next_oli) {
if (in_empty)
has_inside_empty = 1;
// The next list item's indentation (pre) must be the same as
// the previous one (orgpre), otherwise it might be a
// sublist.
if (pre == orgpre) /* the following item must have */
break; /* the same indentation */
// If the indentation does not match the previous one then
// assume that it is a sublist. Check later whether it is
// or not.
if (!sublist)
sublist = work->size;
}
/* joining only indented stuff after empty lines */
else if (in_empty && i < 4 && data[beg] != '\t') {
*flags |= MKD_LI_END;
break;
}
else if (in_empty) {
// Add a line delimiter to the next line if it is missing.
bufputc(work, '\n');
has_inside_empty = 1;
}
in_empty = 0;
beg = end;
}
if (*flags & MKD_LI_BLOCK) {
/* intermediate render of block li */
if (sublist && sublist < work->size) {
parse_block(inter, rndr, work->data, sublist);
parse_block(inter, rndr, work->data + sublist, work->size - sublist);
}
else
parse_block(inter, rndr, work->data, work->size);
}
According to the code, ``parse_listitem`` is called indirectly by
``parse_block`` (via ``parse_list``), but ``parse_block`` is called directly
by ``parse_listitem`` so the code analysis
is not trivial. For this reason I might be mistaken about the 4 spaces
indentation rule.
- https://github.com/vmg/redcarpet/blob/6270d6b4ab6b46ee6bb57a6c0e4b2377c01780ae/ext/redcarpet/markdown.c#L2418
- https://github.com/vmg/redcarpet/blob/6270d6b4ab6b46ee6bb57a6c0e4b2377c01780ae/ext/redcarpet/markdown.c#L1958
Here is an extract of the ``parse_block`` function with the calls to
``parse_list``:
.. code-block:: c
:linenos:
/* parse_block • parsing of one block, returning next uint8_t to parse */
static void
parse_block(struct buf *ob, struct sd_markdown *rndr,
uint8_t *data, size_t size)
{
while (beg < size) {
else if (prefix_uli(txt_data, end))
beg += parse_list(ob, rndr, txt_data, end, 0);
else if (prefix_oli(txt_data, end))
beg += parse_list(ob, rndr, txt_data, end, MKD_LIST_ORDERED);
}
}
Overflows
---------
``cmark``, ``github``, ``gitlab``
`````````````````````````````````
Ordered list markers cannot exceed ``99999999`` according to
the following. If that is the case then a ``GithubOverflowOrderedListMarker``
exception is raised:
- https://spec.commonmark.org/0.30/#ordered-list-marker
``redcarpet``
`````````````
Apparently there are no cases of ordered list marker
overflows:
- https://github.com/vmg/redcarpet/blob/6270d6b4ab6b46ee6bb57a6c0e4b2377c01780ae/ext/redcarpet/markdown.c#L1529
Notes on ordered lists
----------------------
``cmark``, ``github``, ``gitlab``
`````````````````````````````````
Ordered list markers may start with any integer (except special cases).
any following number is ignored and subsequent numeration is progressive:
- https://spec.commonmark.org/0.30/#start-number
However, when you try this in practice this is not always true: nested lists
do not follow the specifications. See:
- https://github.com/frnmst/md-toc/issues/23
Markers cannot be negative:
- https://spec.commonmark.org/0.30/#example-239
``redcarpet``
`````````````
Ordered lists do not use the ``start`` HTML attribute:
any number is ignored and lists starts from 1. See:
- https://github.com/vmg/redcarpet/blob/6270d6b4ab6b46ee6bb57a6c0e4b2377c01780ae/test/MarkdownTest_1.0/Tests/Markdown%20Documentation%20-%20Syntax.html#L323