Maria Nicolae's Website

RSS Feed (What is RSS?)


Back to Blog.

Adding Macros to my Site Generation System

Reducing the verbosity involved in authoring HTML.

Maria Nicolae,

As discussed in my previous post about this website's site generation system, I author the website's content as handwritten HTML, and the site generator handles templating/boilerplate. I do this, instead of using something like Markdown, first and foremost to have full control rather than being locked into any particular set of abstractions. Additionally, this makes CSS styling straightforward, because I know exactly which HTML elements are being used for what, rather than having to figure out which ones the text compiler generates.

However, there are some cases in which HTML is quite cumbersome. First of all is its mechanism for escaping special characters. In Markdown, escaping a character is as simple as preceding it with \. However, in HTML, you need to replace these characters with lengthy codes, like &lt; for <. This was especially a problem in my previous post, where I included HTML code snippets, which have a lot of such characters to escape. Second, HTML's native means of representing maths notation, MathML, is extremely verbose. For example, a2+b2=c2 is represented by

<math>
    <mrow>
        <msup>
            <mi>a</mi>
            <mn>2</mn>
        </msup>
        <mo>+</mo>
        <msup>
            <mi>b</mi>
            <mn>2</mn>
        </msup>
        <mo>=</mo>
        <msup>
            <mi>c</mi>
            <mn>2</mn>
        </msup>
    </mrow>
</math>

I'm planning on authoring some maths-heavy pages for this website in the future, and having to it like this simply won't do; I need a better approach.

Concept

The idea I had was to add a text processing step to the site generation system, that would take codes in the authored text and convert them into the correct HTML; essentially, a macro system. At first, I thought about using custom tags, e.g. <latex>a^2 + b^2 = c^2</latex> for the MathML above, but this is problematic if I want to, for example, insert literal text that looks like a HTML tag. I decided instead that it would be best for the macro codes to look as different to the surrounding HTML code as possible.

What I settled on, then, was to have macro codes that are all prefixed by \. First of all, there is the escape sequence \\ for a literal \, as well as some convenience escape sequences \&, \<, and \> for the HTML special characters &, <, and > respectively. Second, there are the commands

When implementing this in the site generation system, I inserted it as the last step before HTML output, after all of the templating and programmatic content generation, so that I could use these macros in template/boilerplate HTML.

Implementation

(The full implementation of this macro system, in a Python script, is available here.)

The macro processor is implemented as a state machine that makes a single pass through the input text. In its initial state, it echoes its input until it encounters a \. At this point, if the following character is one of \, &, <, or >, it echoes that character and resumes normal operation. Otherwise, it has encountered a macro command, and so it reads the command from the input, and then calls a handler for that macro, resuming normal operation when the handler terminates. The code for this is

import html
import re
...
_macro_handlers = {'verbatim': _verbatim,
                   'verb': _verbatim,
                   'math': _math,
                   'mathblock': _mathblock}

def process_macros(text):
    output = ''
    while len(text) > 0:

        # scan to backslash
        i = text.find('\\')
        if i == -1:
            return output + text
        
        text_before = text[:i]
        text_after = text[i+1:]
        output += text_before
        
        # handle escaped characters
        if text_after[0] in '\\<&>':
            output += html.escape(text_after[0])
            text = text_after[1:]
            continue

        # get macro identifier
        match = re.match('\\w+', text_after)
        if match is None:
            raise ValueError('Invalid macro')

        macro_name = match.group(0)
        if macro_name not in _macro_handlers:
            raise ValueError(f'Unrecognised macro `{macro_name}`')
        i = match.end()

        # run macro handler function
        handler = _macro_handlers[macro_name]
        result, remaining_text = handler(text_after[i:])
        
        # output macro text and set up for next loop
        output += result
        text = remaining_text

    return output

Here, I build up the string output, while the string text is iteratively sliced down to only the text that has not been processed yet. Also, instead of iterating character by character, I use str.find and regex to skip ahead to where state changes happen. This optimises the code by minimising overhead from the Python interpreter.

Macro Handlers

When a macro handler is called, it receives the entire remaining input text as an argument, beginning right after the command name. The handler function then produces an output (result), and the remaining unprocessed input text. That is to say, determining where the macro ends and normal text resumes is the responsibility of the handler. This gives the system more flexibility for how macro commands are delimited, but it also means that macros cannot simply be nested, because there is no actual parsing going on here.

\verb

This macro is modelled after the equivalent command in LaTeX, used to escape a string of text. For example, \verb|<html>| becomes &lt;html&gt;. The first character after the macro name, in this case |, becomes the delimiter of the macro's argument; the text between it and the next instance of | is the text to be escaped. This delimiter can be any character at all, so as to avoid conflicts between it and the argument text. The handler function for this macro is

def _verbatim(text):
    delimiter = text[0]
    text = text[1:]

    # find end of verbatim text
    end = text.find(delimiter)
    if end == -1:
        verbatim_text = text
        remaining_text = ''
    else:
        verbatim_text = text[:end]
        remaining_text = text[end+1:]

    # escape and return verbatim text
    escaped_text = html.escape(verbatim_text)
    return escaped_text, remaining_text

Once again, I use str.find to find the delimiter, and the text enclosed between it, without having to iterate over individual characters.

\math and \mathblock

These macros take as an argument LaTeX maths code inside braces, e.g. \math{x = \frac{-b \pm \sqrt{b^2-4ac}}{2a}}, and translate it into MathML code for inline and block display respectively. Any whitespace between the command and the opening brace is allowed. Because these macros are so similar to each other, their handler functions

def _math(text):
    return _math_core(text, 'inline')

def _mathblock(text):
    return _math_core(text, 'block')

are thin wrappers around a shared function

import latex2mathml.converter
...
def _math_core(text, display):
    latex, text_after = _get_text_in_braces(text)
    mathml = latex2mathml.converter.convert(latex, display=display)

    latex_escaped = html.escape(latex, quote=True)
    i = len('<math ')
    mathml = mathml[:i] + f'alttext="{latex_escaped}" ' + mathml[i:]
    return mathml, text_after

Here, I use the library latex2mathml to convert the LaTeX code in the macro argument into MathML code. Unfortunately, this library does not currently provide any mechanism for setting the alttext attribute, so I use a string manipulation hack to achieve this. I did, however, submit a pull request to add this feature; if and when that is merged, this function will be able to be shortened to

def _math_core(text, display):
    latex, text_after = _get_text_in_braces(text)
    mathml = latex2mathml.converter.convert(latex, display=display,
                                            alt_latex=True)
    return mathml, text_after

The logic for finding the LaTeX code between the braces is

def _get_text_in_braces(text):
    # get text after the opening brace
    text = text.lstrip()
    if text[0] != '{':
        raise ValueError('Macro must begin with {')
    text = text[1:]

    # find balanced closing brace
    depth = 1
    for match in re.finditer('[{}]', text):
        c = match.group(0)
        depth += 1 if c=='{' else -1
        if depth == 0:
            i = match.start()
            text_in_braces = text[:i]
            text_after = text[i+1:]
            break
    else:
        # balanced closing brace not found, return entire text
        text_in_braces = text
        text_after = ''

    return text_in_braces, text_after

Here, the regex [{}] is used to find both opening and closing braces. This function looks for the closing brace ending the LaTeX code by iterating over all braces and keeping track of the nesting depth.

Outlook

This macro system that I have built solves two significant problems for me authoring this website: verbatim text like code blocks, and maths notation. The general approach is applicable to many simple text transformation tasks of this sort. While it is limited by not being a true parser, this does not practically restrict the use-cases that I am interested in.

In the future, as new needs arise, I am likely to expand this system with new macros.