Embedding HTML in Markdown
discussing the requirements, and challenges, of using Redcarpet's markdown engine to mix html snippets with markdown content
Arit has been adding a "unified embed" tag to replace the numerous named liquid tags in Forem. Rather than {% youtube 123123123 %}
, users will use {% embed youtube.com/watch/123123123 %}
or similar. In either case, the desired outcome is a frame or embedding container is added to the rendered html from the tag.
When testing some of these, I noticed we would see html comment tags being rendered as text (instead of <!-- comment -->
you would see <!-- comment --%gt;
on the page.
Arit also sees this, but says she cannot replicate it in production, that it only happens locally. I'm as confused as she is and up for a sleuthing session.
Isolating the source of the issue
I looked at our articles controller's preview action, since I know that when you type markdown, you can see the rendered html either by posting/saving (create) or previewing (the preview tab has an async request to send the body markdown, and receive rendered html back).
Let's check what's there - it's a fairly straightforward setup of a few processes - "fix for preview" (presumably this sanitizes or cleans up input in a way that's similar to normal fixup, in my simple case it didn't change the input so it's safe for me to ignore that line)
The FrontMatterParser#call
is to handle the yaml front matter (if you use the Forem v1 Editor rather than the v2 Editor, the front matter is between the ----
lines at the top of the body, and lets you set the title, cover image, tags, while that's handled by UI elements in the v2 editor. It's also safe to ignore that, and assume that parsed == fixed_body_markdown == params[:article_body]
- i.e. the first two lines made no changes to the input.
Since testing code in controllers is a headache, I leveraged a helpful fact about Forem's Article class, it persists both the body_markdown
(almost raw markdown input from the user) and the processed_html
(the result of passing the markdown input to the rendering stage). So the fastest way to test this on real input from the console was to save an article with an embed tag, and then use it's body_markdown
attribute to get the parsed_markdown
and finalize it.
Here's a comment embed that I was working with (this comment belongs to a faker generated user's faker generated comment on my local dev environment, you probably don't have this url).
The problem ensues when we embed the comment and generate the processed_html for the article, we get something we didn't want, a code block with the html escaped closing div tag, and a p tag with the html escaped comment about the end of the comment liquid tag view template.
Everything here was fine until the closing </p>
after gluten free, when a div is created to house a <pre><code>
block (it's larger then it might be because controls for the code block are added during processing, but the core is <code></div>\n</code></pre>
).
After that code block another line is added with what was intended to be an html comment, now showing in the article body as text:
So something is going wrong. But what?
Tracing the Code along its path
ArticlesController#preview
Let's go back to the controller snippet and try to simulate this.
So, one thing I would check when the html for a rendered template comes out malformed is the template. It's here https://github.com/forem/forem/blob/main/app/views/comments/_liquid.html.erb (that comment sure makes finding it easy!) - and it looks right (there aren't any mismatched tags, the parts of the template inside the <% if comment %>
block seem to have been added to the rendered view, and the parts that are in the else block aren't rendered. So far so good (we could write a unit test for the view and pass a comment in the context, but there's an existing test in https://github.com/forem/forem/blob/main/spec/liquid_tags/comment_tag_spec.rb that covers the core idea of what should and shouldn't be showing here).
MarkdownProcessor::Parse#finalize
So let's take a look at this MarkdownProcessor::Parser#finalize
method and see what it does. We'll be passing it a string (it's the body markdown from the comment), a source and user keyword.
If you're not used to pry, there are a few convenience utilities we can use. In my case, I'll use "show-source" and "cd". cd
changes the current context so methods/names are looked up inside an object (instead of in Kernel or whatever object is the top-level when you start pry). I don't know that irb
has the same facilities, but Forem's rails console launches a pry session for you, so this is accessible:
Note, we render twice, once before interpreting liquid tags, render(escaped_content)
, and once after render(parsed_liquid.render).
We pass a source and user to new (https://github.com/forem/forem/blob/main/app/services/markdown_processor/parser.rb#L12) and since preview didn't pass the link_attributes keyword arg, we'll use a default empty hash. @content
, @source,
and @user
are set (I've cd'd into the existing parser), so we can call finalize at will.
Here's the method we're looking at (the result is rendered html, and has the unwanted content we're investigating):
So we encounter a few new co-conspirators in our crime story, Redcarpet::Markdown and Redcarpet::Render::HTMLRouge, Redcarpet comes from a gem, the Rouge code syntax highlighting gem has a plugin module for Redcarpet that's included in the custom class (Forem's code, linked) as well as noticing we have a config constant defined for Redcarpet options. Like the preview code, we see a few "pre-wash" methods called to sanitize or transform input to the state we need before rendering, and the main-even here is markdown.render(escaped_content)
. We'll be passing the renderer to the Markdown instance, presumable it will use the renderer to render, and will concern itself with parsing rather than presentation. Let's just assume those two always travel together, and we can see we set hard wrap true, filter html false, and no link attributes in the renderer. The renderer appears to mainly be concerned with rules for showing links, formatting headings so they have linkable anchors, and formatting code blocks to permit highlighting using a capitalized hint language like "C" or "Ruby" rather than "c" or "ruby". It includes Rouge and inherits from Redcarpet::Render::HTML - which is where initialization is defined (we call new and pass it args, but don't define initialize, so we're only extending behavior). It's almost correct to call this a plain HTML object (we could swap out the renderer to experiment, but that's a bit further off course than I'm going here).
Since the input @content
string is so short, I'll skip over the catch_xss_attempts
, convert_code_tags_to_triple_backticks
and escape_liquid_tags_in_codeblock
methods, which leave the content unchanged. It's safe, for this limited situation, to assume this holds, and jump into the render method.
This means html = markdown.render(escaped_content)
is the thing to look at.
Now is about the time I do a little searching, and I come up with this great article from Vaidehi on how to use Redcarpet http://vaidehijoshi.github.io/blog/2015/08/11/rolling-out-the-redcarpet-for-rendering-markdown/ - which almost looks like it was a dry run for writing this code. There's also a "cheat sheet" https://www.lukeko.com/11/using-rouge-with-redcarpet here - most of this is good background for what we're looking at and whether we're doing it right, but ultimately doesn't help get closer to our mystery - which is "why do we malform comment embeds"?
Markdown#render
It's about to get a little hairy.
First pass (markdown render escaped content) just does link->link expansion - given an unadorned link, it generates an A tag with the link target as the content)
sanitize_rendered_markdown
does nothing important here, since there's nothing to sanitize, so build up the liquid template parser, and render it (basically, change the embed to the rendered comment template in the views/ directory)
We see the console output that the comment, it's author/user were loaded from the db, and two templates were rendered (the date partial is embedded in the comment partial).
This stage's output looks right - after substituting the embed we get a liquid comment div, more or less what we expect from the liquid parser. At this point, nothing has been changed that we didn't want.
This is the input to the second pass of render:
So we're at the last line of our code before we change what we think we want into what we don't want...
Redcarpet::Markdown#render
The Markdown class is defined in lib/redcarpet.rb - as stub to add an attr_reader for the renderer. Most of the functionality is defined in the extension
So we're at a limit of what ruby introspection can give us. But, where does render get defined? While we ponder that we can try to build a smaller test case. Since the problem behavior is "we get a code block when we didn't want one" or "we left the html processing too early", we can try to make a string that should give all html, but instead gives a mix of code and html.
I'm of course cheating a little here, I know what the end result will be so I can isolate this a little better. In markdown, lines that start with four spaces indentation (or more) are assumed to be code snippets (this is in distinction to the other convention of fenced code blocks using ``` or ~~~ characters to start code blocks).
So we're looking for a situation where what might be html gets interpreted instead as a code block (and sanitized for display).
Redcarpet internals
So, where is the class defined? Let's go to the source in the rc_markdown.c file, we see VALUE (ruby object's C represenation, everything C passes back or receives from ruby will be of type VALUE).
The convention here is _
is a :: scope resolution, a c prefix marks a class, an m prefix marks a module... The markdown extensions (in our Constants::Redcarpet::CONFIG hash) are a map of symbol keys to booleans, the code loads those in rb_redcarpet_md_flags
the rb_redcarpet_md__new
method returns a VALUE (an instance of Redcarpet::Markdown, or a subclass?), and wraps the underlying Sundown library that redcarpet builds on, you see names like sd_markdown
which grew out of the original library (same author). rc_markdown.c
has rb_redcarpet_md_render
which is the ruby facing Redcarpet::Markdown#render
method I was looking for.
This looks like normal glue code (hint, this is not the interesting part, it's the bridge between ruby and the C guts where the parsing actually happens), there's some type coercion, precondition checks (input text was a T_STRING
value), pulls the @renderer
instance variable into the rb_rndr
variable, sets up the output buffer (where are we sending the data, in our case a string like object to be returned later), then, following the comment "render the magic" we call sd_markdown_render
(the actual work starts here), and then we'll build a return value from the output buffer by encoding as a ruby string type, release the allocated output buffer, check if the renderer responds to "postprocess", pass the pre-return string text
to rb_render.postprocess(text)
(the rb_funcall
line) and finally answer text
back to the caller (in ruby). If you've written C extensions in Ruby, or Python, very little should be surprising in this file, and sd_markdown_render
is our real entry point.
sd_markdown_render
This was lifted from the prior "sundown" markdown library.
The important call (most of this is scanning the document for special cases, references, headers and footnotes) is parse_block(ob, md, text->data, text->size)
which has the dispatch table to determine what to do with the input. The "return" value from this void function is parsed markdown in the ob buffer.
parse_block
This is the dispatch table, so it importantly includes the tests to determine which case we're in
The function acts as a large loop, handling the next processable unit of text, until the input has been fully consumed. Each of the dispatched functions will advance the offset pointer beg
- which marks the beginning of the unprocessed text.
Per pass, in order, one of these will be performed:
is_atxheader
- true when the line starts with some number (1-6) of#
characters - it recognizes from h1 to h6, followed by a space (so seven hash marks in a row should not be marked as<h6>#</h6>
, but would fall through to the next case.Line starts with a '<' character, an html rendering callback exists, conditionally
parse_html
, treating as success if an html block was parsed.is_empty
- if the line is empty, skip it.is_hrule
- may consist of up to three initial spaces, then either * or - or _ characters, all the way to the end of the line, at least three (but four or more are recognized).attempt to parse a fenced code block in
parse_fencecode
like parse_htmlblock, treating as success if one was found and handledif
prefix_quote
, callparse_blockquote
- prefix quote checks up to three space characters, followed by a '>' character, followed by a space.if
prefix_code
,parse_blockcode
prefix code checks if there are 4 spaces (or more) at the beginning of the line. This expansion is controlled by setting theMKDEXT_DISABLE_INDENTED_CODE
flag in the options.if
prefix_uli
, thenparse_list
(is this an unorderd list item (ul)?)if
prefix_oli
, then parse_list, settingMKD_LIST_ORDERED
option (was this an ordered list item?)otherwise, this is just plain text, call
parse_paragraph
So, while a careful reading might allow us to guess what's happening, long before I got this far I had changed to using GDB. https://gist.github.com/djuber/ef6123d3cf7d1527b5ba92bb6fb2f428 are my original notes while I hunted this down from the outside (just find names that looked like that had to do with html or code blocks, set breakpoints, and print the struct members to see what was happening). Informatively, I did see the behavior described in parse block by setting a watch on the output buffer size (if the output buffer changed, I know we'd just parsed some chunk).
Reasoning about the rules
Having described the parse_block loop, we can discuss the meaning (to us, for this use case) of the rules.
if we want to embed html, it must begin on the first character of the line. Unlike prefix quote or a few other recognizers, we're going to assume it's html only when the first character on the line is a
<
open angle bracket.it seems like we also need to be aware of when html blocks end, as well as when the parser observes they begin.
the behavior of the passes for the input (after the liquid tag has been parsed and replaced with the preview content) is something like "recognize the p tag at the beginning of the first line of
parsed_liquid.render
, including the html comment, scanning forward until we find the closing tag. Unfortuntely, the closing tag is<
/p>
which happens in the comment. The next line after the/p
starts with four spaces indentation, so we start a code tag, and the line after that is interpreted as a paragraph (and sanitized for html).
I'll ignore the consequences of generating nested paragraph tags (you shouldn't mix "flow" and "content" tags, a p tag puts you in phrasing content mode) https://html.spec.whatwg.org/#phrasing-content-2 The practical consequences are we probably don't want to expand *indented* html in the liquid tags, before the next stage.
Solution?
I was tracking down where the BEGIN and END template marker comments get added.
https://github.com/rails/rails/blob/v6.1.4.4/actionview/lib/action_view/template/handlers/erb.rb#L53-L56 this conditional in action view's template handler for erb says "if annotate rendered view with filenames and this is html, then add html comment". That's exactly what's creating the <p> tag in the markdown -- the liquid tag processor adds this as the first line, the markdown parser prepends a p tag to the comment since it's a special case, the closing p tag in the comment body causes the parser to exit html block mode and go back to the decision process in parse_block() which finds 4 characters indentation
To test, setting the variable false, and calling Liquid::Template.parse gives no comment, and MarkdownProcessor::Parser#finalize doesn't insert a code block or render the comment as text.
Coming full circle, we enable this option locally in development and test, but it's off by default, and not enabled in production, which is why Arit could not replicate this in canary or other production environments.
Assuming we like the template comments (I find them useful) - can we toggle it off in the MarkdownProcessor? Something like this to execute the rendering in an annotation free environment?
Last updated
Was this helpful?