Monadic Parsers: Implementing a micro Parsec
So I read somewhere that one could truly understand the beauty of monads by implementing monadic parsers. Given holidays and a lazy weekend, I decided to put that idea to the test.
A large section of the remainder of this post is derived and inspired from this amazing paper by Graham Hutton and Eric Meijer. I recommend giving it a read. It gives this matter far better treatment than a rookie’s blog post.
This post does not assume any background knowledge on Monads. Knowing about parsers might be helpful. Some basic Haskell knowledge would help too.
Code for this post can be found here.
Enter a Parser
Lets start off by defining what is a parser.
data Parser a = Parser (String -> [(a, String)])
So a Parser is a function that takes a String and returns a list of pairs of results and remaining Strings. It consumes a String, and produces a result of type a, along with the rest of the String that was left unparsed. To those not used to Type Constructors, just view a as a generic type in Java, or C++.
Therefore an empty list would denote the failure of a parser, where as a non-empty list would imply success.
Returning a list allows this definition to handle ambigious grammars that return multiple results. This would imply a string can be parsed in multiple ways.
If we only wanted to worry about getting a single result, we could have defined the parser with an option type.
data Parser a = Parser (String -> Maybe (a, String))
In this scenario None would imply failure, while getting Just (a,String) would imply success with a single result pair.
However, we are going to stick with the first definition, as grammars can be ambigious.
Our first Parser!
So lets now implement the simplest Parser possible:
item :: Parser Char
item = Parser (\s -> case s of
"" -> []
(c:cs) -> [(c,cs)])
item is a parser that consumes the first character in a string, and returns that character as a result. It fails by returning an empty List, if the String is empty.
Enter bind
So far you’re probably wondering that the hell Monads have to do with anything yet. We’re going to approach Monads in a bottom up way.
Firstly lets define a function to apply a Parser
parse (Parser p) = p
I.e. extract the function from the Parser, and apply it.
So now with the interesting bit.
Lets say we want the ability to “bind” two parsers together. That is, chain the parsers such that result of one parser is fed into a function that takes that result and produces a new parser.
Another way to think about this is that you are composing two parsers together. This will be more obvious later on.
So lets implement bind
bind :: Parser a -> (a -> Parser b) -> Parser b
bind p f = Parser (\s -> concat $
map (\ (a, s') -> parse (f a) s')
$ parse p s)
So if you read it backwards, first we apply the Parser p on the String s, which would return a list of tuples. Next we map over this list, and apply the function f to the result a. And then we parse the remaning string s’ with the new Parser we got from (f a). In the end we simply concat to flatten the list.
Perhaps this bit of refactoring would make this more readable:
bind :: Parser a -> (a -> Parser b) -> Parser b
bind p f = Parser (\s -> parse p s |>
map (\ (a, s') -> parse (f a) s') |>
concat)
Sidenote. I personally prefer the OCaml/F# style of using the pipe operator |>
for left to right application, which for me is more readable.
(|>) :: b -> (b -> c) -> c
(|>) = flip ($)
Enter unit
The next building block we are going to work on is the function unit
. unit
takes any value and produces a Parser from it which consumes nothing and returns that value.
unit :: a -> Parser a
unit a = Parser (\s -> [(a,s)])
This may not seem very useful, but bear with me.
Why bind and unit?
So why did we implemenent bind
and unit
. Well they act as a building blocks for making parser combinators. Parser combinators, to those who haven’t heard of them before, are compositions of parsers.
A good example here would be a combinator called satisfies
. This takes a predicate and returns a Parser that consumes a character if that predicate is satisfied, or fails otherwise.
Using bind
and unit
, we can implement satisfies
:
satisfies :: (Char -> Bool) -> Parser Char
satisfies p = item `bind` \c ->
if p c then unit c else (Parser (\cs -> []))
item
from before is a parser that consumes the first character and returns it. This character is passed to a lambda that takes applies the predicat p
on that character and returns Parser constructed from ‘unit’ denoting success, or a Parser returning an empty list, denoting failure.
Enter Monads
So, we have actually implemented our Parser monad already.
Monads are defined as the type class :
class Monad m where
return :: a -> m a
(>>=) :: m a -> (a -> m b) -> m b
If you look hard at the type signatures of return
and >>=
they are exactly the same as our unit
and bind
respectively, if you replace the m type variable with Parser
.
As it happens, this pattern is quite common in many other areas, which is why the Monads are so ubiquitous in haskell.
So we could refactor our implementations of bind
and unit
by making them instances of the Monad typeclass:
instance Monad Parser where
return a = Parser (\s -> [(a,s)])
p >>= f = Parser (\s -> parse p s |>
map (\ (a, s') -> parse (f a) s') |>
concat)
Et Voila! We have implemented a Parser Monad.
For a better treatment of Monads, I strongly recommend this tutorial
do Notation
Typically parsers build using >>=
would look like this:
p1 >>= \a1 ->
p2 >>= \a2 ->
...
pn >>= \an ->
f a1 a2 ... an
Such a parser reads: Apply p1, and call its result a1. Then apply p2 and call it’s result a2. And so on. In the end perform some semantic action with f using the parser results. This could mean returning an AST node, or whatever you’re trying to parse.
Haskell has this sexy syntactic sugar which makes this more readable:
do a1 <- p1
a2 <- p2
...
an <- pn
f a1 a2 ... an
Essentially both representations mean the same thing.
The do notation can be written in a single line:
do {a1 <- p1; a2 <- p2; ...; an <- pn; f a1 a2 ... an}
Just to clarify, I’ll implement a parser that consumes three characters, throws the second character away, and returns a tuple of the other two.
p :: Parser (Char,Char)
p = do {c <- item; item; d <- item; return (c,d)}
The thing to note here is how I don’t have a <-
subexpression for the second item
. That’s because I don’t care about it’s result.
More combinators
Now it’s time to implement some common combinators found in most Parser Combinator libraries.
class Monad m => MonadPlus m where
mzero :: m a
mplus :: m a -> m a -> m a
-- mplus for the Parser is like an choice operator.
instance MonadPlus Parser where
mzero = Parser (\cs -> [])
mplus p q = Parser (\s -> (parse p) s ++ (parse q) s)
option :: Parser a -> Parser a -> Parser a
option p q = Parser (\s -> case parse (mplus p q) s of
[] -> []
(x:xs) -> [x])
The MonadPlus type class is another common pattern, which applies to our Parser implementation. The mzero
method basically denotes failure, while mplus
which in it’s implementation concatenates the results from two Parsers, acts as a sort of choice operator. This is because the failure is []
, so if the first parser fails, []
is simply concatenated with the result of the second parser.
Because in most cases we only care about the first result, we have another combinator option
which just applies mplus
and returns the first result on success, or the []
on failure.
We also introduce two new combinators, char
and string
. char
takes a character and consumes and returns it if present, otherwise fails . string
takes a string and consumes that string if found in the input.
char :: Char -> Parser Char
char c = satisfies (c ==)
string :: String -> Parser String
string "" = return ""
string (c:cs) = do { char c; string cs; return (c:cs)}
As the implementation of string
shows, parsers can be recursive.
many :: Parser a -> Parser [a]
many p = many1 p `option` return []
many1 :: Parser a -> Parser [a]
many1 p = do { a <- p; as <- many p; return (a:as)}
sepBy :: Parser a -> Parser b -> Parser [a]
p `sepBy` sep = (p `sepBy1` sep) `option` return []
sepBy1 :: Parser a -> Parser b -> Parser [a]
p `sepBy1` sep = do a <- p
as <- many (do {sep; p})
return (a:as)
many
parses repeated (0 or more) applications of Parser p
. many1
parsers 1 or more repeated applications of p
.
sepBy
parses repeated applications (0 or more) of p
separated by applications of Parser sep
whose values are thrown away. sepBy1
imples 1 or more application as before.
Lexical Parsing
Parsing is usually preceded by a lexical phase which tokenises the input. With parser combinators we can define parses to perform tokenising for us.
space :: Parser String
space = many (satisfies isSpace)
where isSpace ' ' = True
isSpace '\n' = True
isSpace '\t' = True
isSpace _ = False
token :: Parser a -> Parser a
token p = do { a <- p; space ; return a}
symb :: String -> Parser String
symb s = token (string s)
digit :: Parser Char
digit = satisfies C.isDigit
number :: Parser Int
number = do cs <- many1 digit
return $ read cs
The space
combinator consumes any whitespace. The token
combinator returns a token ignoring any trailing space. The digit
combinator parses a single digit, while the number
combinator parses integers.
Example Grammars!
Armed with combinators, we can now implement some grammars, putting this little library to action.
The first example is of a simple math expression language with infix operators.
expr :: Parser Int
addop :: Parser (Int -> Int -> Int)
mulop :: Parser (Int -> Int -> Int)
expr = term `chainl1` addop
term = factor `chainl1` mulop
factor = number `option` do { symb "("; n <- expr; symb ")"; return n}
addop = do {symb "+"; return (+)} `option` do {symb "-"; return (-)}
mulop = do {symb "*"; return (*)} `option` do {symb "/"; return (div)}
run :: String -> Int
run s = case parse expr s of
[(num, _)] -> num
Trying this out:
λ> run "1"
1
λ> run "3+4"
7
λ> run "3+5*3"
18
λ> run "10/5+15"
17
It works!
The only new combinator here is the chainl1
combinator which parses repated applications of a parser p
separated by applications of a parser op
whose result value is used to combine the result from the p
parsers.
chainl :: Parser a -> Parser (a -> a -> a) -> a -> Parser a
chainl p op a = (p `chainl1` op) `option` return a
chainl1 :: Parser a -> Parser (a -> a -> a) -> Parser a
p `chainl1` op = do {a <- p; rest a}
where rest a = (do f <- op
b <- p
rest (f a b))
`option` return a
Parsing JSON
Here is a simple JSON parser, I implemented in a few minutes using this little library.
module Json where
import Parser
data JSONVal = JBool Bool |
JString String |
JNumber Int |
JNull |
JArray [JSONVal] |
JPair (String, JSONVal) |
JObject [(String, JSONVal)]
deriving (Show)
parseBool :: Parser JSONVal
parseBool = do { symb "true"; return $ JBool True} `option`
do { symb "false"; return $ JBool False }
parseString :: Parser JSONVal
parseString = do { s <- quotedString; return $ JString s}
parseNumber :: Parser JSONVal
parseNumber = do { n <- number; return $ JNumber n}
parseNull :: Parser JSONVal
parseNull = do { symb "null"; return JNull}
parseArray :: Parser JSONVal
parseArray = do { symb "["; l <- parseJson `sepBy` (symb ","); symb "]"; return $ JArray l}
parsePair :: Parser JSONVal
parsePair = do { k <- quotedString; symb ":"; v <- parseJson; return $ JPair (k,v)}
parseObject :: Parser JSONVal
parseObject = do { symb "{"; ps <- many parsePair ; symb "}";
return $ JObject $ map (\ (JPair p) -> p) ps}
parseJson :: Parser JSONVal
parseJson = parseBool `option` parseString `option` parseNumber `option`
parseNull `option` parseArray `option` parseObject
Conclusion
After this, one should be able to recognize the beauty of Monads. Personally I was completely mind blown after realising that implementing Parser Combinators is this easy.
Having said that, I wouldn’t recommend using this little library for anything serious. Parsec exists, and does this and far more, including proper error handling which I don’t care about for now.