Anchor link types and behaviours ================================ Generic ------- ``cmark``, ``github`` ````````````````````` A translated version of the Ruby algorithm is used in md-toc. The original one is repored here: - https://github.com/jch/html-pipeline/blob/master/lib/html/pipeline/toc_filter.rb I could not find the code directly responsable for the anchor link generation. See also: - https://github.github.com/gfm/ - https://githubengineering.com/a-formal-spec-for-github-markdown/ - https://github.com/github/cmark/issues/65#issuecomment-343433978 Apparently GitHub (and possibly others) filter HTML tags in the anchor links. This is an undocumented feature (?) so the ``remove_html_tags`` function was added to address this problem. Instead of designing an algorithm to detect HTML tags, regular expressions came in handy. All the rules present in https://spec.commonmark.org/0.28/#raw-html have been followed by the letter. Regular expressions are divided by type and are composed at the end by concatenating all the strings. For example: .. code-block:: python :linenos: # Comment start. COS = '' # Comment. CO = COS + COT + COE HTML tags are stripped using the ``re.sub`` replace function, for example: .. code-block:: ruby line = re.sub(CO, str(), line, flags=re.DOTALL) GitHub added an extension in GFM to ignore certain HTML tags, valid at least from versions `0.27.1.gfm.3` to `0.29.0.gfm.0`: - https://github.github.com/gfm/#disallowed-raw-html-extension- - https://github.com/github/cmark-gfm/blob/fca380ca85c046233c39523717073153e2458c1e/extensions/tagfilter.c ``gitlab`` `````````` New rules have been written: - https://docs.gitlab.com/ee/user/markdown.html#header-ids-and-links ``redcarpet`` ````````````` Treats consecutive dash characters by tranforming them into a single dash character. A translated version of the C algorithm is used in md-toc. The original version is here: - https://github.com/vmg/redcarpet/blob/6270d6b4ab6b46ee6bb57a6c0e4b2377c01780ae/ext/redcarpet/html.c#L274 See also: - https://github.com/vmg/redcarpet/issues/618#issuecomment-306476184 - https://github.com/vmg/redcarpet/issues/307#issuecomment-261793668 Emphasis -------- To be able to have working anchor links, emphasis must also be removed from the link destination. ``cmark``, ``github``, ``gitlab`` `````````````````````````````````` At the moment the implementation of emnphasis removal is incomplete because of its complexity. See: - https://spec.commonmark.org/0.30/#emphasis-and-strong-emphasis The core functions for this feature have been ported directly from the original cmark source with some differences: #. things such as string manipulation, mallocs, etc are different in Python #. the ``cmark_utf8proc_charlen`` uses ``length = 1`` instead of ``length = utf8proc_utf8class[ord(line[0])]`` (causes list overflow). The ``cmark_utf8proc_charlen`` function is related to the ``cmark_utf8proc_encode_char`` function. Have a look at that function to know character lengths in cmark. In Python 3, since all characters are UTF-8 by default, they are all represented with length 1. See: - https://rosettacode.org/wiki/String_length#Python - https://docs.python.org/3/howto/unicode.html#comparing-strings As of the release md-toc 8.1.2, cmark-gfm is still at version 0.29. Moreover, certain code sections used in the emphasis processing are not the same of cmark 0.29. See this one for example: - https://github.com/github/cmark-gfm/blob/0.29.0.gfm.3/src/inlines.c#L639-L654 - https://github.com/commonmark/cmark/blob/0.29.0/src/inlines.c#L615-L621 For the moment md-toc uses the original cmark source only as reference for emphasis processing.