Omer Iqbal bio photo

Omer Iqbal

geek . lisper . full-stack dev . emacs padawan . coffee addict

Twitter Facebook Github

Monadic Parsers: Implementing a micro Parsec

So I read somewhere that one could truly understand the beauty of monads by implementing monadic parsers. Given holidays and a lazy weekend, I decided to put that idea to the test.

A large section of the remainder of this post is derived and inspired from this amazing paper by Graham Hutton and Eric Meijer. I recommend giving it a read. It gives this matter far better treatment than a rookie’s blog post.

This post does not assume any background knowledge on Monads. Knowing about parsers might be helpful. Some basic Haskell knowledge would help too.

Code for this post can be found here.

Enter a Parser

Lets start off by defining what is a parser.

data Parser a = Parser (String -> [(a, String)])

So a Parser is a function that takes a String and returns a list of pairs of results and remaining Strings. It consumes a String, and produces a result of type a, along with the rest of the String that was left unparsed. To those not used to Type Constructors, just view a as a generic type in Java, or C++.

Therefore an empty list would denote the failure of a parser, where as a non-empty list would imply success.

Returning a list allows this definition to handle ambigious grammars that return multiple results. This would imply a string can be parsed in multiple ways.

If we only wanted to worry about getting a single result, we could have defined the parser with an option type.

data Parser a = Parser (String -> Maybe (a, String))

In this scenario None would imply failure, while getting Just (a,String) would imply success with a single result pair.

However, we are going to stick with the first definition, as grammars can be ambigious.

Our first Parser!

So lets now implement the simplest Parser possible:

item :: Parser Char
item = Parser (\s -> case s of
                     "" -> []
                     (c:cs) -> [(c,cs)])

item is a parser that consumes the first character in a string, and returns that character as a result. It fails by returning an empty List, if the String is empty.

Enter bind

So far you’re probably wondering that the hell Monads have to do with anything yet. We’re going to approach Monads in a bottom up way.

Firstly lets define a function to apply a Parser

parse (Parser p) = p

I.e. extract the function from the Parser, and apply it.

So now with the interesting bit.

Lets say we want the ability to “bind” two parsers together. That is, chain the parsers such that result of one parser is fed into a function that takes that result and produces a new parser.

Another way to think about this is that you are composing two parsers together. This will be more obvious later on.

So lets implement bind

bind :: Parser a -> (a -> Parser b) -> Parser b
bind p f = Parser (\s -> concat $
                         map (\ (a, s') -> parse (f a) s')
                             $ parse p s)

So if you read it backwards, first we apply the Parser p on the String s, which would return a list of tuples. Next we map over this list, and apply the function f to the result a. And then we parse the remaning string s’ with the new Parser we got from (f a). In the end we simply concat to flatten the list.

Perhaps this bit of refactoring would make this more readable:

bind :: Parser a -> (a -> Parser b) -> Parser b
bind p f = Parser (\s -> parse p s |>
                         map (\ (a, s') -> parse (f a) s') |>
                         concat)

Sidenote. I personally prefer the OCaml/F# style of using the pipe operator |> for left to right application, which for me is more readable.

(|>) :: b -> (b -> c) -> c
(|>) = flip ($)

Enter unit

The next building block we are going to work on is the function unit. unit takes any value and produces a Parser from it which consumes nothing and returns that value.

unit :: a -> Parser a
unit a = Parser (\s -> [(a,s)])

This may not seem very useful, but bear with me.

Why bind and unit?

So why did we implemenent bind and unit. Well they act as a building blocks for making parser combinators. Parser combinators, to those who haven’t heard of them before, are compositions of parsers.

A good example here would be a combinator called satisfies. This takes a predicate and returns a Parser that consumes a character if that predicate is satisfied, or fails otherwise.

Using bind and unit, we can implement satisfies:

satisfies :: (Char -> Bool) -> Parser Char
satisfies p = item `bind` \c ->
              if p c then unit c else (Parser (\cs -> []))

item from before is a parser that consumes the first character and returns it. This character is passed to a lambda that takes applies the predicat p on that character and returns Parser constructed from ‘unit’ denoting success, or a Parser returning an empty list, denoting failure.

Enter Monads

So, we have actually implemented our Parser monad already.

Monads are defined as the type class :

class Monad m where
      return :: a -> m a 
      (>>=) :: m a -> (a -> m b) -> m b

If you look hard at the type signatures of return and >>= they are exactly the same as our unit and bind respectively, if you replace the m type variable with Parser.

As it happens, this pattern is quite common in many other areas, which is why the Monads are so ubiquitous in haskell.

So we could refactor our implementations of bind and unit by making them instances of the Monad typeclass:

instance Monad Parser where
   return a = Parser (\s -> [(a,s)])
   p >>= f = Parser (\s -> parse p s |>
                         map (\ (a, s') -> parse (f a) s') |>
                         concat)

Et Voila! We have implemented a Parser Monad.

For a better treatment of Monads, I strongly recommend this tutorial

do Notation

Typically parsers build using >>= would look like this:

p1 >>= \a1 ->
p2 >>= \a2 ->
...
pn >>= \an ->
f a1 a2 ... an

Such a parser reads: Apply p1, and call its result a1. Then apply p2 and call it’s result a2. And so on. In the end perform some semantic action with f using the parser results. This could mean returning an AST node, or whatever you’re trying to parse.

Haskell has this sexy syntactic sugar which makes this more readable:

do a1 <- p1
   a2 <- p2
   ...
   an <- pn
   f a1 a2 ... an

Essentially both representations mean the same thing.

The do notation can be written in a single line:

do {a1 <- p1; a2 <- p2; ...; an <- pn; f a1 a2 ... an}

Just to clarify, I’ll implement a parser that consumes three characters, throws the second character away, and returns a tuple of the other two.

p :: Parser (Char,Char)
p = do {c <- item; item; d <- item; return (c,d)}

The thing to note here is how I don’t have a <- subexpression for the second item. That’s because I don’t care about it’s result.

More combinators

Now it’s time to implement some common combinators found in most Parser Combinator libraries.

class Monad m => MonadPlus m where
      mzero :: m a
      mplus :: m a -> m a -> m a


-- mplus for the Parser is like an choice operator.
instance MonadPlus Parser where
         mzero = Parser (\cs -> [])
         mplus p q = Parser (\s -> (parse p) s ++ (parse q) s)

option :: Parser a -> Parser a -> Parser a
option  p q = Parser (\s -> case parse (mplus p q) s of
                                [] -> []
                                (x:xs) -> [x])

The MonadPlus type class is another common pattern, which applies to our Parser implementation. The mzero method basically denotes failure, while mplus which in it’s implementation concatenates the results from two Parsers, acts as a sort of choice operator. This is because the failure is [], so if the first parser fails, [] is simply concatenated with the result of the second parser.

Because in most cases we only care about the first result, we have another combinator option which just applies mplus and returns the first result on success, or the [] on failure.

We also introduce two new combinators, char and string. char takes a character and consumes and returns it if present, otherwise fails . string takes a string and consumes that string if found in the input.

char :: Char -> Parser Char
char c = satisfies (c ==)

string :: String -> Parser String
string "" = return ""
string (c:cs) = do { char c; string cs; return (c:cs)}

As the implementation of string shows, parsers can be recursive.

many :: Parser a -> Parser [a]
many p = many1 p `option` return []

many1 :: Parser a -> Parser [a]
many1 p = do { a <- p; as <- many p; return (a:as)}

sepBy :: Parser a -> Parser b -> Parser [a]
p `sepBy` sep = (p `sepBy1` sep) `option` return []

sepBy1 :: Parser a -> Parser b -> Parser [a]
p `sepBy1` sep = do a <- p
                    as <- many (do {sep; p})
                    return (a:as)

many parses repeated (0 or more) applications of Parser p. many1 parsers 1 or more repeated applications of p.

sepBy parses repeated applications (0 or more) of p separated by applications of Parser sep whose values are thrown away. sepBy1 imples 1 or more application as before.

Lexical Parsing

Parsing is usually preceded by a lexical phase which tokenises the input. With parser combinators we can define parses to perform tokenising for us.

space :: Parser String
space = many (satisfies isSpace)
       where isSpace ' ' = True
             isSpace '\n' = True
             isSpace '\t' = True
             isSpace _ = False

token :: Parser a -> Parser a
token p = do { a <- p; space ; return a}

symb :: String -> Parser String
symb s = token (string s)

digit :: Parser Char
digit = satisfies C.isDigit

number :: Parser Int
number = do cs <- many1 digit
            return $ read cs

The space combinator consumes any whitespace. The token combinator returns a token ignoring any trailing space. The digit combinator parses a single digit, while the number combinator parses integers.

Example Grammars!

Armed with combinators, we can now implement some grammars, putting this little library to action.

The first example is of a simple math expression language with infix operators.

expr :: Parser Int
addop :: Parser (Int -> Int -> Int)
mulop :: Parser (Int -> Int -> Int)

expr = term `chainl1` addop
term = factor `chainl1` mulop
factor = number `option` do { symb "("; n <- expr; symb ")"; return n}

addop = do {symb "+"; return (+)} `option` do {symb "-"; return (-)}
mulop = do {symb "*"; return (*)} `option` do {symb "/"; return (div)}

run :: String -> Int
run s = case parse expr s of
             [(num, _)] -> num

Trying this out:

λ> run "1"
1
λ> run "3+4"
7
λ> run "3+5*3"
18
λ> run "10/5+15"
17

It works!

The only new combinator here is the chainl1 combinator which parses repated applications of a parser p separated by applications of a parser op whose result value is used to combine the result from the p parsers.

chainl :: Parser a -> Parser (a -> a -> a) -> a -> Parser a
chainl p op a = (p `chainl1` op) `option` return a

chainl1 :: Parser a -> Parser (a -> a -> a) -> Parser a
p `chainl1` op = do {a <- p; rest a}
                 where rest a = (do f <- op
                                    b <- p
                                    rest (f a b))
                                `option` return a

Parsing JSON

Here is a simple JSON parser, I implemented in a few minutes using this little library.

module Json where
import Parser

data JSONVal = JBool Bool |
               JString String |
               JNumber Int |
               JNull |
               JArray [JSONVal] |
               JPair (String, JSONVal) |
               JObject [(String, JSONVal)]
               deriving (Show)


parseBool :: Parser JSONVal
parseBool = do { symb "true"; return $ JBool True} `option`
            do { symb "false"; return $ JBool False }

parseString :: Parser JSONVal
parseString = do { s <- quotedString; return $ JString s}

parseNumber :: Parser JSONVal
parseNumber = do { n <- number; return $ JNumber n}

parseNull :: Parser JSONVal
parseNull = do { symb "null"; return JNull}

parseArray :: Parser JSONVal
parseArray = do { symb "["; l <- parseJson `sepBy` (symb ","); symb "]"; return $ JArray l}

parsePair :: Parser JSONVal
parsePair = do { k <- quotedString; symb ":"; v <- parseJson; return $ JPair (k,v)}

parseObject :: Parser JSONVal
parseObject = do { symb "{"; ps <- many parsePair ; symb "}";
                   return $ JObject $ map (\ (JPair p) -> p) ps}

parseJson :: Parser JSONVal
parseJson = parseBool `option` parseString `option` parseNumber `option`
            parseNull `option` parseArray `option` parseObject

Conclusion

After this, one should be able to recognize the beauty of Monads. Personally I was completely mind blown after realising that implementing Parser Combinators is this easy.

Having said that, I wouldn’t recommend using this little library for anything serious. Parsec exists, and does this and far more, including proper error handling which I don’t care about for now.