md-toc
8.1.3
  • Installation
  • Developer Interface
  • Markdown spec
  • Rules
    • Anchor link types and behaviours
      • Generic
        • cmark, github
        • gitlab
        • redcarpet
      • Emphasis
        • cmark, github, gitlab
    • Code fence
    • Headers
    • Link label
    • List items
    • TOC marker
  • Pre-commit hook
  • Contributing
  • Workflow
  • Source code
  • Copyright and License
md-toc
  • »
  • Rules »
  • Anchor link types and behaviours
  • View page source
Previous Next

Anchor link types and behaviours

Generic

cmark, github

A translated version of the Ruby algorithm is used in md-toc. The original one is repored here:

  • https://github.com/jch/html-pipeline/blob/master/lib/html/pipeline/toc_filter.rb

I could not find the code directly responsable for the anchor link generation. See also:

  • https://github.github.com/gfm/

  • https://githubengineering.com/a-formal-spec-for-github-markdown/

  • https://github.com/github/cmark/issues/65#issuecomment-343433978

Apparently GitHub (and possibly others) filter HTML tags in the anchor links. This is an undocumented feature (?) so the remove_html_tags function was added to address this problem. Instead of designing an algorithm to detect HTML tags, regular expressions came in handy. All the rules present in https://spec.commonmark.org/0.28/#raw-html have been followed by the letter. Regular expressions are divided by type and are composed at the end by concatenating all the strings. For example:

1# Comment start.
2COS = '<!--'
3# Comment text.
4COT = '((?!>|->)(?:(?!--).))+(?!-).?'
5# Comment end.
6COE = '-->'
7# Comment.
8CO = COS + COT + COE

HTML tags are stripped using the re.sub replace function, for example:

line = re.sub(CO, str(), line, flags=re.DOTALL)

GitHub added an extension in GFM to ignore certain HTML tags, valid at least from versions 0.27.1.gfm.3 to 0.29.0.gfm.0:

  • https://github.github.com/gfm/#disallowed-raw-html-extension-

  • https://github.com/github/cmark-gfm/blob/fca380ca85c046233c39523717073153e2458c1e/extensions/tagfilter.c

gitlab

New rules have been written:

  • https://docs.gitlab.com/ee/user/markdown.html#header-ids-and-links

redcarpet

Treats consecutive dash characters by tranforming them into a single dash character. A translated version of the C algorithm is used in md-toc. The original version is here:

  • https://github.com/vmg/redcarpet/blob/6270d6b4ab6b46ee6bb57a6c0e4b2377c01780ae/ext/redcarpet/html.c#L274

See also:

  • https://github.com/vmg/redcarpet/issues/618#issuecomment-306476184

  • https://github.com/vmg/redcarpet/issues/307#issuecomment-261793668

Emphasis

To be able to have working anchor links, emphasis must also be removed from the link destination.

cmark, github, gitlab

At the moment the implementation of emnphasis removal is incomplete because of its complexity. See:

  • https://spec.commonmark.org/0.30/#emphasis-and-strong-emphasis

The core functions for this feature have been ported directly from the original cmark source with some differences:

  1. things such as string manipulation, mallocs, etc are different in Python

  2. the cmark_utf8proc_charlen uses length = 1 instead of length = utf8proc_utf8class[ord(line[0])] (causes list overflow).

    The cmark_utf8proc_charlen function is related to the cmark_utf8proc_encode_char function. Have a look at that function to know character lengths in cmark.

    In Python 3, since all characters are UTF-8 by default, they are all represented with length 1. See:

    • https://rosettacode.org/wiki/String_length#Python

    • https://docs.python.org/3/howto/unicode.html#comparing-strings

As of the release md-toc 8.1.2, cmark-gfm is still at version 0.29. Moreover, certain code sections used in the emphasis processing are not the same of cmark 0.29. See this one for example:

  • https://github.com/github/cmark-gfm/blob/0.29.0.gfm.3/src/inlines.c#L639-L654

  • https://github.com/commonmark/cmark/blob/0.29.0/src/inlines.c#L615-L621

For the moment md-toc uses the original cmark source only as reference for emphasis processing.

Previous Next

© Copyright 2017-2022, Franco Masotti. Last updated on 2022-04-20 12:17:52 +0200.

Built with Sphinx using a theme provided by Read the Docs.