Adding Macros to my Site Generation System
Reducing the verbosity involved in authoring HTML.
Maria Nicolae,
As discussed in my previous post about this website's site generation system, I author the website's content as handwritten HTML, and the site generator handles templating/boilerplate. I do this, instead of using something like Markdown, first and foremost to have full control rather than being locked into any particular set of abstractions. Additionally, this makes CSS styling straightforward, because I know exactly which HTML elements are being used for what, rather than having to figure out which ones the text compiler generates.
However, there are some cases in which HTML is quite cumbersome. First of all is its mechanism for escaping special characters. In Markdown, escaping a character is as simple as preceding it with \
. However, in HTML, you need to replace these characters with lengthy codes, like <
for <
. This was especially a problem in my previous post, where I included HTML code snippets, which have a lot of such characters to escape. Second, HTML's native means of representing maths notation, MathML, is extremely verbose. For example, is represented by
<math>
<mrow>
<msup>
<mi>a</mi>
<mn>2</mn>
</msup>
<mo>+</mo>
<msup>
<mi>b</mi>
<mn>2</mn>
</msup>
<mo>=</mo>
<msup>
<mi>c</mi>
<mn>2</mn>
</msup>
</mrow>
</math>
I'm planning on authoring some maths-heavy pages for this website in the future, and having to it like this simply won't do; I need a better approach.
Concept
The idea I had was to add a text processing step to the site generation system, that would take codes in the authored text and convert them into the correct HTML; essentially, a macro system. At first, I thought about using custom tags, e.g. <latex>a^2 + b^2 = c^2</latex>
for the MathML above, but this is problematic if I want to, for example, insert literal text that looks like a HTML tag. I decided instead that it would be best for the macro codes to look as different to the surrounding HTML code as possible.
What I settled on, then, was to have macro codes that are all prefixed by \
. First of all, there is the escape sequence \\
for a literal \
, as well as some convenience escape sequences \&
, \<
, and \>
for the HTML special characters &
, <
, and >
respectively. Second, there are the commands
\verb
, which escapes an extended string of literal (verbatim) text,\math
, which converts LaTeX maths code into inline-display MathML, and\mathblock
, which generates block-display MathML.
When implementing this in the site generation system, I inserted it as the last step before HTML output, after all of the templating and programmatic content generation, so that I could use these macros in template/boilerplate HTML.
Implementation
(The full implementation of this macro system, in a Python script, is available here.)
The macro processor is implemented as a state machine that makes a single pass through the input text. In its initial state, it echoes its input until it encounters a \
. At this point, if the following character is one of \
, &
, <
, or >
, it echoes that character and resumes normal operation. Otherwise, it has encountered a macro command, and so it reads the command from the input, and then calls a handler for that macro, resuming normal operation when the handler terminates. The code for this is
import html
import re
...
_macro_handlers = {'verbatim': _verbatim,
'verb': _verbatim,
'math': _math,
'mathblock': _mathblock}
def process_macros(text):
output = ''
while len(text) > 0:
# scan to backslash
i = text.find('\\')
if i == -1:
return output + text
text_before = text[:i]
text_after = text[i+1:]
output += text_before
# handle escaped characters
if text_after[0] in '\\<&>':
output += html.escape(text_after[0])
text = text_after[1:]
continue
# get macro identifier
match = re.match('\\w+', text_after)
if match is None:
raise ValueError('Invalid macro')
macro_name = match.group(0)
if macro_name not in _macro_handlers:
raise ValueError(f'Unrecognised macro `{macro_name}`')
i = match.end()
# run macro handler function
handler = _macro_handlers[macro_name]
result, remaining_text = handler(text_after[i:])
# output macro text and set up for next loop
output += result
text = remaining_text
return output
Here, I build up the string output
, while the string text
is iteratively sliced down to only the text that has not been processed yet. Also, instead of iterating character by character, I use str.find
and regex to skip ahead to where state changes happen. This optimises the code by minimising overhead from the Python interpreter.
Macro Handlers
When a macro handler is called, it receives the entire remaining input text as an argument, beginning right after the command name. The handler function then produces an output (result
), and the remaining unprocessed input text. That is to say, determining where the macro ends and normal text resumes is the responsibility of the handler. This gives the system more flexibility for how macro commands are delimited, but it also means that macros cannot simply be nested, because there is no actual parsing going on here.
\verb
This macro is modelled after the equivalent command in LaTeX, used to escape a string of text. For example, \verb|<html>|
becomes <html>
. The first character after the macro name, in this case |
, becomes the delimiter of the macro's argument; the text between it and the next instance of |
is the text to be escaped. This delimiter can be any character at all, so as to avoid conflicts between it and the argument text. The handler function for this macro is
def _verbatim(text):
delimiter = text[0]
text = text[1:]
# find end of verbatim text
end = text.find(delimiter)
if end == -1:
verbatim_text = text
remaining_text = ''
else:
verbatim_text = text[:end]
remaining_text = text[end+1:]
# escape and return verbatim text
escaped_text = html.escape(verbatim_text)
return escaped_text, remaining_text
Once again, I use str.find
to find the delimiter, and the text enclosed between it, without having to iterate over individual characters.
\math and \mathblock
These macros take as an argument LaTeX maths code inside braces, e.g. \math{x = \frac{-b \pm \sqrt{b^2-4ac}}{2a}}
, and translate it into MathML code for inline and block display respectively. Any whitespace between the command and the opening brace is allowed. Because these macros are so similar to each other, their handler functions
def _math(text):
return _math_core(text, 'inline')
def _mathblock(text):
return _math_core(text, 'block')
are thin wrappers around a shared function
import latex2mathml.converter
...
def _math_core(text, display):
latex, text_after = _get_text_in_braces(text)
mathml = latex2mathml.converter.convert(latex, display=display)
latex_escaped = html.escape(latex, quote=True)
i = len('<math ')
mathml = mathml[:i] + f'alttext="{latex_escaped}" ' + mathml[i:]
return mathml, text_after
Here, I use the library latex2mathml
to convert the LaTeX code in the macro argument into MathML code. Unfortunately, this library does not currently provide any mechanism for setting the alttext
attribute, so I use a string manipulation hack to achieve this. I did, however, submit a pull request to add this feature; if and when that is merged, this function will be able to be shortened to
def _math_core(text, display):
latex, text_after = _get_text_in_braces(text)
mathml = latex2mathml.converter.convert(latex, display=display,
alt_latex=True)
return mathml, text_after
The logic for finding the LaTeX code between the braces is
def _get_text_in_braces(text):
# get text after the opening brace
text = text.lstrip()
if text[0] != '{':
raise ValueError('Macro must begin with {')
text = text[1:]
# find balanced closing brace
depth = 1
for match in re.finditer('[{}]', text):
c = match.group(0)
depth += 1 if c=='{' else -1
if depth == 0:
i = match.start()
text_in_braces = text[:i]
text_after = text[i+1:]
break
else:
# balanced closing brace not found, return entire text
text_in_braces = text
text_after = ''
return text_in_braces, text_after
Here, the regex [{}]
is used to find both opening and closing braces. This function looks for the closing brace ending the LaTeX code by iterating over all braces and keeping track of the nesting depth.
Outlook
This macro system that I have built solves two significant problems for me authoring this website: verbatim text like code blocks, and maths notation. The general approach is applicable to many simple text transformation tasks of this sort. While it is limited by not being a true parser, this does not practically restrict the use-cases that I am interested in.
In the future, as new needs arise, I am likely to expand this system with new macros.