building an arithmetic expression parser

Posted on December 16, 2023

I’ve been having a lot of fun playing around with Haskell’s monadic parser libraries. As part of learning, I have:

  • Completed a few of 2023’s advent of code challenges, where in parsing inputs attoparsec proved super handy.
  • Made a partial PNG parser (mostly out of spite), using megaparsec. Will hopefully finish this project and talk about it in a separate post.
  • Started building an arithmetic expression parser (also with megaparsec).

I thought the arithmetic expression parser turned out to be a really fun exercise, and wanted to share the process.

I’m going to assume a basic familiarity with the Parsec family of parsers, but I’m hoping that the ideas are clear enough that anyone can follow along.

modelling arithmetic expressions

It’s worth spelling out explicitly what the end goal is. Given a arithmetic expression in a string, e.g., “1+2*(3-4)”, we would like to build up some unambiguous representation of the expression in a data structure. Before we can commence, then, we have to decide on a suitable data structure.

Let’s begin with our atoms: numbers. These will form the building blocks for our data structure. Because we want to support arbitrary precision float arithmetic, we will represent them by Rational (which we do incur a runtime computational penalty for):

data Expr = Number Rational

The next simplest thing to tackle is negation. Notice that in general we want to be able to negate any expression, which leads naturally to a recursive definition:

data Expr = Number Rational | Negate Expr

Where negation acts on a single expression, addition, subtraction, multiplication, and division all act on two. In this sense, we classify the them as binary operators, as opposed to negation as a unary operator:

data BinOp = Add | Subtract | Multiply | Divide
data Expr = Number Rational | Negate Expr | BinOp BinOp Expr Expr

In programming language theory, this class of data structure is called an abstract syntax tree (AST). It is the canonical representation of expressions (also called terms) in most contexts, not just arithmetic. On a high level, we represent the expression “1*2*(3-4)” in a structure like:

            ┌───┐
      ┌─────┤ * ├─────┐
      │     └───┘     │
    ┌─┴─┐           ┌─┴─┐
  ┌─┤ * ├─┐       ┌─┤ - ├─┐
  │ └───┘ │       │ └───┘ │
┌─┴─┐   ┌─┴─┐   ┌─┴─┐   ┌─┴─┐
│ 1 │   │ 2 │   │ 3 │   │ 4 │
└───┘   └───┘   └───┘   └───┘

parsing

numbers

As we did with modelling the data, it’ll be easiest to approach parsing in the order of numbers, negation, and finally binary operators.

We want to parse both doubles and integers, and megaparsec’s Text.Parsec.Char.Lexer provides nice facilities for accomplishing that:

number :: Parser Expr
number = Number . toRational <$> (try float <|> decimal)

Notice that we try parsing a float because we want to take back any input it consumes before attempting to parse a decimal.

negation

As we defined it before, negation is recursive. Specifically, in a string, negation can act on:

  1. A number, e.g. "-1.5";
  2. Another negative, e.g. "--3";
  3. A term surrounded by brackets, e.g. "-(3*4)"

Note that a pair of brackets can act on exactly the same items in a context with only numbers, minus signs, and brackets.

Let’s call this collection of terms negatable:

negatable :: Parser Expr
negatable = choice [negative, bracketed, number]

negative :: Parser Expr
negative = Negate <$> (char '-' *> negatable)

bracketed :: Parser Expr
bracketed = between (char '(') (char ')') (try add <|> negatable)

At this stage, it’s a good idea to check that things are behaving the way we expect:

ghci> parseTest negative (T.pack "-(-2)")
Negate (Negate (Number (2 % 1)))
ghci> parseTest negative (T.pack "---2)")
Negate (Negate (Negate (Number (2 % 1))))

Looks good!

addition

Let’s deal with addition before moving on to multiplication. You might reasonably define addition in terms of two expressions delimited by a ‘+’. At this point, an expression can be an addition or a negatable, so we’ll define bracketed expressions in terms of that too:

add :: Parser Expr
add =
    let side = add <|> negatable
        op = (char '+' $> Add)
        in do
            lhs <- side
            BinOp <$> op <*> pure lhs <*> side


bracketed :: Parser Expr
bracketed = between (char '(') (char ')') (try add <|> negatable)

That try before the add is critical, because we need to roll back any input add consumed before attempting to parse a negatable inside the brackets.

Notice, though, that the first thing add does here is call itself. That’s going to throw the whole thing into a loop! So in order to make progress, let’s deal with chains of addition via the some combinator, and then roll them into a tree with a foldr:

add :: Parser Expr
add =
    let side = negatable
        op = char '+' $> Add
     in do
            lhs <- side
            rhs <- some ((,) <$> op <*> side)
            let (op1, rhs1) = head rhs
                start = BinOp op1 lhs rhs1
            pure
                $ foldr (\(newOp, newRhs) t -> BinOp newOp t newRhs) start (tail rhs)

Recall that some parses 1 or more matches into a list, where many can match none and return an empty list. Calling head on the result should therefore never error.

Another sanity check:

ghci> parseTest add (T.pack "1+2+3")
BinOp Add (BinOp Add (Number (1 % 1)) (Number (2 % 1))) (Number (3 % 1))
ghci> parseTest add (T.pack "1+(2+3)")
BinOp Add (Number (1 % 1)) (BinOp Add (Number (2 % 1)) (Number (3 % 1)))
ghci> parseTest add (T.pack "1+(2+-3)")
BinOp Add (Number (1 % 1)) (BinOp Add (Number (2 % 1)) (Negate (Number (3 % 1))))

Now, since in regular arithmetic, addition and subtraction have the same precedence, all we have to do in order to parse subtraction is modify the op variable:

op = Add <$ char '+' <|> Subtract <$ char '-'

multiplication

As a binary operator, parsing multiplication and division is going to look very similar to addition. We can abstract the logic that will overlap into its own function:

binOp :: Parser BinOp -> Parser Expr -> Parser Expr
binOp op side = do
    lhs <- side
    rhs <- some ((,) <$> op <*> side)
    let (op1, rhs1) = head rhs
        start = BinOp op1 lhs rhs1
    pure
        $ foldr (\(newOp, newRhs) t -> BinOp newOp t newRhs) start (tail rhs)

The question, then, is what side should be for each of addition/subtraction and multiplication/division. Let’s see what happens if both are negatable:

add :: Parser Expr
add = binOp (Add <$ char '+' <|> Subtract <$ char '-') negatable

mult :: Parser Expr
mult = binOp (Multiply <$ char '*' <|> Divide <$ char '/') negatable
ghci> parseTest mult (T.pack "1+2*3")
1:2:
  |
1 | 1+2*3
  |  ^
unexpected '+'
expecting '*', '/', or digit
ghci> parseTest add (T.pack "1+2*3")
BinOp Add (Number (1 % 1)) (Number (2 % 1))

Clearly, this is not what we want. We can solve this again by thinking about precedence: if an addition involves terms which multiply, those terms should be parsed first:

add :: Parser Expr
add = binOp (Add <$ char '+' <|> Subtract <$ char '-') (try mult <|> negatable)

The fact that mult fails on the example above is actually desired behaviour, in the sense that the overall expression is an addition, namely of the terms “1” and “2*3”:

ghci> parseTest add (T.pack "1+2*3")
BinOp Add (Number (1 % 1)) (BinOp Multiply (Number (2 % 1)) (Number (3 % 1)))
ghci> parseTest mult (T.pack "1+2*3")
1:2:
  |
1 | 1+2*3
  |  ^
unexpected '+'
expecting '*', '/', or digit

expressions

At this stage, let’s establish a parser for an entire expr. We have three high-level parsers to work with, namely add, mult, and negatable. Which order should we attempt parsing in?

The key insight is that we want to work upwards in terms of precedence. If we began parsing of “1*2+3” by trying multiplication, we would successfully parse 1*2 and then fail to recognise the addition. Thus, we try parsing add, then mult, then negatable:

expr :: Parser Expr
expr = try add <|> try mult <|> negatable

-- ensures we consume all input
full :: Parser Expr
full = expr <* eof

We also need to update bracketed to allow arbitrary expressions:

bracketed :: Parser Expr
bracketed = between (char '(') (char ')') expr

Let’s do one final sanity check:

ghci> parseTest full (T.pack "1*2+3*(4--5)")
BinOp Add (BinOp Multiply (Number (1 % 1)) (Number (2 % 1))) (BinOp Multiply (Number (3 % 1)) (BinOp Subtract (Number (4 % 1)) (Negate (Number (5 % 1)))))

conclusions

Obviously, more testing should be performed before concluding that the parser is sound. Here’s all of the parsing logic together:

binOp :: Parser BinOp -> Parser Expr -> Parser Expr
binOp op side = do
    lhs <- side
    rhs <- some ((,) <$> op <*> side)
    let (op1, rhs1) = head rhs
        start = BinOp op1 lhs rhs1
    pure
        $ foldr (\(newOp, newRhs) t -> BinOp newOp t newRhs) start (tail rhs)

add :: Parser Expr
add = binOp (Add <$ char '+' <|> Subtract <$ char '-') (try mult <|> negatable)

mult :: Parser Expr
mult = binOp (Multiply <$ char '*' <|> Divide <$ char '/') negatable

negative :: Parser Expr
negative = Negate <$> (char '-' *> negatable)

number :: Parser Expr
number = Number . toRational <$> (try float <|> decimal)

negatable :: Parser Expr
negatable = choice [negative, bracketed, number]

bracketed :: Parser Expr
bracketed = between (char '(') (char ')') expr

expr :: Parser Expr
expr = try add <|> try mult <|> negatable

full :: Parser Expr
full = expr <* eof

expressivity

It’s pretty nuts how concise the entire parser is; the whole thing is just 31 lines, a majority of which is whitespace or type declarations! Equivalent code in other languages often takes many more lines.

I do think the expressiveness comes at a cost, though. In particular, debugging becomes a very involved process when the functions are co-recursive.

Also worth noting is the liberal use of try. I haven’t done any rigorous analysis, but there is definitely a class of expressions which coerces the algorithm to do a lot of backtracking. For instance, expressions which contain only negative signs, brackets, and a digit will actually compute the complete answer at least 3 times: once as the potential left hand side of an addition; once as the potential left hand side of a multiplication; and finally accepted as a negatable.

evaluation

A side effect of our choice of representation is that evaluating parsed expressions is trivial:

eval :: Expr -> Rational
eval (Number x) = x
eval (Negate x) = negate (eval x)
eval (BinOp op x y) = case op of
    Add -> eval x + eval y
    Subtract -> eval x - eval y
    Multiply -> eval x * eval y
    Divide -> eval x / eval y

Now making a simple calculator is also super easy:

calculate :: (Fractional a) => String -> Either String a
calculate =
    left errorBundlePretty
        . fmap (fromRational . eval)
        . runParser full "input"
        . T.pack

Evaluation is the most immediate consequence of the data structure, but other transformations of the AST are definitely worth exploring, too!

disclaimer

I am not a programming language expert! This is definitely my first attempt at this kind of problem, and I have no formal background in the area. I was essentially stumbling my way through the exercise, but that made it all the more fun.

If I’ve missed something completely obvious, or otherwise come off as supremely ignorant, feel free to let me know. I’m interested enough in the area that I think it’s time to go off and read about how expressions are parsed in more general contexts.