I’ve been having a lot of fun playing around with Haskell’s monadic parser libraries. As part of learning, I have:
- Completed a few of 2023’s advent of code challenges, where in parsing inputs
attoparsec
proved super handy. - Made a partial PNG parser (mostly out of spite), using
megaparsec
. Will hopefully finish this project and talk about it in a separate post. - Started building an arithmetic expression parser (also with
megaparsec
).
I thought the arithmetic expression parser turned out to be a really fun exercise, and wanted to share the process.
I’m going to assume a basic
familiarity with the Parsec
family of parsers, but I’m hoping that the ideas
are clear enough that anyone can follow along.
modelling arithmetic expressions
It’s worth spelling out explicitly what the end goal is. Given a arithmetic expression in a string, e.g., “1+2*(3-4)”, we would like to build up some unambiguous representation of the expression in a data structure. Before we can commence, then, we have to decide on a suitable data structure.
Let’s begin with our atoms: numbers. These will form the building blocks for our
data structure. Because we want to support arbitrary precision float
arithmetic, we will represent them by Rational
(which we do incur a runtime
computational penalty for):
data Expr = Number Rational
The next simplest thing to tackle is negation. Notice that in general we want to be able to negate any expression, which leads naturally to a recursive definition:
data Expr = Number Rational | Negate Expr
Where negation acts on a single expression, addition, subtraction, multiplication, and division all act on two. In this sense, we classify the them as binary operators, as opposed to negation as a unary operator:
data BinOp = Add | Subtract | Multiply | Divide
data Expr = Number Rational | Negate Expr | BinOp BinOp Expr Expr
In programming language theory, this class of data structure is called an abstract syntax tree (AST). It is the canonical representation of expressions (also called terms) in most contexts, not just arithmetic. On a high level, we represent the expression “1*2*(3-4)” in a structure like:
┌───┐
┌─────┤ * ├─────┐
│ └───┘ │
┌─┴─┐ ┌─┴─┐
┌─┤ * ├─┐ ┌─┤ - ├─┐
│ └───┘ │ │ └───┘ │
┌─┴─┐ ┌─┴─┐ ┌─┴─┐ ┌─┴─┐
│ 1 │ │ 2 │ │ 3 │ │ 4 │
└───┘ └───┘ └───┘ └───┘
parsing
numbers
As we did with modelling the data, it’ll be easiest to approach parsing in the order of numbers, negation, and finally binary operators.
We want to parse both doubles and integers, and megaparsec
’s
Text.Parsec.Char.Lexer
provides nice facilities for accomplishing that:
number :: Parser Expr
= Number . toRational <$> (try float <|> decimal) number
Notice that we try
parsing a float because we want to take back any input it
consumes before attempting to parse a decimal.
negation
As we defined it before, negation is recursive. Specifically, in a string, negation can act on:
- A number, e.g.
"-1.5"
; - Another negative, e.g.
"--3"
; - A term surrounded by brackets, e.g.
"-(3*4)"
Note that a pair of brackets can act on exactly the same items in a context with only numbers, minus signs, and brackets.
Let’s call this collection of terms negatable:
negatable :: Parser Expr
= choice [negative, bracketed, number]
negatable
negative :: Parser Expr
= Negate <$> (char '-' *> negatable)
negative
bracketed :: Parser Expr
= between (char '(') (char ')') (try add <|> negatable) bracketed
At this stage, it’s a good idea to check that things are behaving the way we expect:
> parseTest negative (T.pack "-(-2)")
ghciNegate (Negate (Number (2 % 1)))
> parseTest negative (T.pack "---2)")
ghciNegate (Negate (Negate (Number (2 % 1))))
Looks good!
addition
Let’s deal with addition before moving on to multiplication. You might reasonably define addition in terms of two expressions delimited by a ‘+’. At this point, an expression can be an addition or a negatable, so we’ll define bracketed expressions in terms of that too:
add :: Parser Expr
=
add let side = add <|> negatable
= (char '+' $> Add)
op in do
<- side
lhs BinOp <$> op <*> pure lhs <*> side
bracketed :: Parser Expr
= between (char '(') (char ')') (try add <|> negatable) bracketed
That try
before the add
is critical, because we need to roll back any input
add
consumed before attempting to parse a negatable inside the brackets.
Notice, though, that the first thing add
does here is call itself. That’s
going to throw the whole thing into a loop! So in order to make progress, let’s
deal with chains of addition via the some
combinator, and then roll them into
a tree with a foldr
:
add :: Parser Expr
=
add let side = negatable
= char '+' $> Add
op in do
<- side
lhs <- some ((,) <$> op <*> side)
rhs let (op1, rhs1) = head rhs
= BinOp op1 lhs rhs1
start pure
$ foldr (\(newOp, newRhs) t -> BinOp newOp t newRhs) start (tail rhs)
Recall that some
parses 1 or more matches into a list, where many
can match
none and return an empty list. Calling head
on the result should therefore
never error.
Another sanity check:
> parseTest add (T.pack "1+2+3")
ghciBinOp Add (BinOp Add (Number (1 % 1)) (Number (2 % 1))) (Number (3 % 1))
> parseTest add (T.pack "1+(2+3)")
ghciBinOp Add (Number (1 % 1)) (BinOp Add (Number (2 % 1)) (Number (3 % 1)))
> parseTest add (T.pack "1+(2+-3)")
ghciBinOp Add (Number (1 % 1)) (BinOp Add (Number (2 % 1)) (Negate (Number (3 % 1))))
Now, since in regular arithmetic, addition and subtraction have the same
precedence, all we have to do in order to parse subtraction is modify the op
variable:
= Add <$ char '+' <|> Subtract <$ char '-' op
multiplication
As a binary operator, parsing multiplication and division is going to look very similar to addition. We can abstract the logic that will overlap into its own function:
binOp :: Parser BinOp -> Parser Expr -> Parser Expr
= do
binOp op side <- side
lhs <- some ((,) <$> op <*> side)
rhs let (op1, rhs1) = head rhs
= BinOp op1 lhs rhs1
start pure
$ foldr (\(newOp, newRhs) t -> BinOp newOp t newRhs) start (tail rhs)
The question, then, is what side
should be for each of addition/subtraction
and multiplication/division. Let’s see what happens if both are negatable
:
add :: Parser Expr
= binOp (Add <$ char '+' <|> Subtract <$ char '-') negatable
add
mult :: Parser Expr
= binOp (Multiply <$ char '*' <|> Divide <$ char '/') negatable mult
> parseTest mult (T.pack "1+2*3")
ghci1:2:
|
1 | 1+2*3
| ^
'+'
unexpected '*', '/', or digit
expecting > parseTest add (T.pack "1+2*3")
ghciBinOp Add (Number (1 % 1)) (Number (2 % 1))
Clearly, this is not what we want. We can solve this again by thinking about precedence: if an addition involves terms which multiply, those terms should be parsed first:
add :: Parser Expr
= binOp (Add <$ char '+' <|> Subtract <$ char '-') (try mult <|> negatable) add
The fact that mult
fails on the example above is actually desired behaviour,
in the sense that the overall expression is an addition, namely of
the terms “1” and “2*3”:
> parseTest add (T.pack "1+2*3")
ghciBinOp Add (Number (1 % 1)) (BinOp Multiply (Number (2 % 1)) (Number (3 % 1)))
> parseTest mult (T.pack "1+2*3")
ghci1:2:
|
1 | 1+2*3
| ^
'+'
unexpected '*', '/', or digit expecting
expressions
At this stage, let’s establish a parser for an entire expr
. We have three
high-level parsers to work with, namely add
, mult
, and negatable
. Which
order should we attempt parsing in?
The key insight is that we want to work upwards in terms of precedence. If we
began parsing of “1*2+3” by trying multiplication, we would successfully parse
1*2 and then fail to recognise the addition. Thus, we try parsing add
, then
mult
, then negatable
:
expr :: Parser Expr
= try add <|> try mult <|> negatable
expr
-- ensures we consume all input
full :: Parser Expr
= expr <* eof full
We also need to update bracketed
to allow arbitrary expressions:
bracketed :: Parser Expr
= between (char '(') (char ')') expr bracketed
Let’s do one final sanity check:
> parseTest full (T.pack "1*2+3*(4--5)")
ghciBinOp Add (BinOp Multiply (Number (1 % 1)) (Number (2 % 1))) (BinOp Multiply (Number (3 % 1)) (BinOp Subtract (Number (4 % 1)) (Negate (Number (5 % 1)))))
conclusions
Obviously, more testing should be performed before concluding that the parser is sound. Here’s all of the parsing logic together:
binOp :: Parser BinOp -> Parser Expr -> Parser Expr
= do
binOp op side <- side
lhs <- some ((,) <$> op <*> side)
rhs let (op1, rhs1) = head rhs
= BinOp op1 lhs rhs1
start pure
$ foldr (\(newOp, newRhs) t -> BinOp newOp t newRhs) start (tail rhs)
add :: Parser Expr
= binOp (Add <$ char '+' <|> Subtract <$ char '-') (try mult <|> negatable)
add
mult :: Parser Expr
= binOp (Multiply <$ char '*' <|> Divide <$ char '/') negatable
mult
negative :: Parser Expr
= Negate <$> (char '-' *> negatable)
negative
number :: Parser Expr
= Number . toRational <$> (try float <|> decimal)
number
negatable :: Parser Expr
= choice [negative, bracketed, number]
negatable
bracketed :: Parser Expr
= between (char '(') (char ')') expr
bracketed
expr :: Parser Expr
= try add <|> try mult <|> negatable
expr
full :: Parser Expr
= expr <* eof full
expressivity
It’s pretty nuts how concise the entire parser is; the whole thing is just 31 lines, a majority of which is whitespace or type declarations! Equivalent code in other languages often takes many more lines.
I do think the expressiveness comes at a cost, though. In particular, debugging becomes a very involved process when the functions are co-recursive.
Also worth noting is the liberal use of try
. I haven’t done any rigorous
analysis, but there is definitely a class of expressions which coerces the
algorithm to do a lot of backtracking. For instance, expressions which contain
only negative signs, brackets, and a digit will actually compute the complete
answer at least 3 times: once as the potential left hand side of an addition;
once as the potential left hand side of a multiplication; and finally accepted
as a negatable
.
evaluation
A side effect of our choice of representation is that evaluating parsed expressions is trivial:
eval :: Expr -> Rational
Number x) = x
eval (Negate x) = negate (eval x)
eval (BinOp op x y) = case op of
eval (Add -> eval x + eval y
Subtract -> eval x - eval y
Multiply -> eval x * eval y
Divide -> eval x / eval y
Now making a simple calculator is also super easy:
calculate :: (Fractional a) => String -> Either String a
=
calculate
left errorBundlePretty. fmap (fromRational . eval)
. runParser full "input"
. T.pack
Evaluation is the most immediate consequence of the data structure, but other transformations of the AST are definitely worth exploring, too!
disclaimer
I am not a programming language expert! This is definitely my first attempt at this kind of problem, and I have no formal background in the area. I was essentially stumbling my way through the exercise, but that made it all the more fun.
If I’ve missed something completely obvious, or otherwise come off as supremely ignorant, feel free to let me know. I’m interested enough in the area that I think it’s time to go off and read about how expressions are parsed in more general contexts.