Anchor link types and behaviours#
Generic#
cmark
, github
#
A translated version of the Ruby algorithm is used in md-toc. The original one is repored here:
I could not find the code directly responsable for the anchor link generation. See also:
Apparently GitHub (and possibly others) filter HTML tags in the anchor links.
This is an undocumented feature (?) so the remove_html_tags
function was
added to address this problem. Instead of designing an algorithm to detect HTML tags,
regular expressions came in handy. All the rules
present in https://spec.commonmark.org/0.28/#raw-html have been followed by the
letter. Regular expressions are divided by type and are composed at the end
by concatenating all the strings. For example:
1# Comment start.
2COS = '<!--'
3# Comment text.
4COT = '((?!>|->)(?:(?!--).))+(?!-).?'
5# Comment end.
6COE = '-->'
7# Comment.
8CO = COS + COT + COE
HTML tags are stripped using the re.sub
replace function, for example:
line = re.sub(CO, str(), line, flags=re.DOTALL)
GitHub added an extension in GFM to ignore certain HTML tags, valid at least from versions 0.27.1.gfm.3 to 0.29.0.gfm.0:
gitlab
#
New rules have been written:
redcarpet
#
Treats consecutive dash characters by tranforming them into a single dash character. A translated version of the C algorithm is used in md-toc. The original version is here:
See also:
Emphasis#
To be able to have working anchor links, emphasis must also be removed from the link destination.
cmark
, github
, gitlab
#
At the moment the implementation of emnphasis removal is incomplete because of its complexity. See:
The core functions for this feature have been ported directly from the original cmark source with some differences:
things such as string manipulation, mallocs, etc are different in Python
the
cmark_utf8proc_charlen
useslength = 1
instead oflength = utf8proc_utf8class[ord(line[0])]
(causes list overflow).The
cmark_utf8proc_charlen
function is related to thecmark_utf8proc_encode_char
function. Have a look at that function to know character lengths in cmark.In Python 3, since all characters are UTF-8 by default, they are all represented with length 1. See:
As of the release md-toc 8.1.2, cmark-gfm is still at version 0.29. Moreover, certain code sections used in the emphasis processing are not the same of cmark 0.29. See this one for example:
For the moment md-toc uses the original cmark source only as reference for emphasis processing.