Parsing

Idris 2 comes with a Lexing and a Parsing library built into the contrib package.

For this cookbook, we will write a very simple parser for a lambda calculus parser that will accept the following language:

let name = world in (\x.hello x) name

Once we write a lambda calculus parser, we will also see how we can take advantage of a powerful built in expression parser in Idris 2 to write a small calculator that should be smart enough to parse the following expression:

1 + 2 - 3 * 4 / 5

Lexer

The main lexer module is under Text.Lexer. This module contains toTokenMap which is a function that converts a List (Lexer, k) -> TokenMap (Token k) where k is a token kind. This function could be used for simple lexer to token mappings. The module also includes high level lexers for specifying quantity and common programming primitives like alphas, intLit, lineComment and blockComment.

The Text.Lexer module also reexports Text.Lexer.Core, Text.Quantity and Text.Token.

Text.Lexer.Core provides the building blocks of the lexer, including a type called Recognise which is the underlying data type for the lexer. The other important function that this module provides is a lex which takes in a lexer and returns the tokens.

Text.Quantity provides a data type Quantity which can be used with certain lexers to specify how many times something is expected to appear.

Text.Token provides a data type Token that represents a parsed token, its kind and the text. This module also provides an important interface called TokenKind which tells the lexer how to map token kinds to Idris 2 types and how to convert each kind from a string to a value.

Parser

The main parser module is under Text.Parser. This module contains different grammar parsers, the main one being match which takes a TokenKind and returns the value as defined in the TokenKind interface. There are other grammar parsers as well, but for our example, we will only be using match.

The Text.Parser module reexports Text.Parser.Core, Text.Quantity and Text.Token.

Text.Parser.Core provides the building blocks of the parser, including a type called Grammar which is the underlying data type for the parser. The other important function that this module provides is parse which takes in a Grammar and returns the parsed expression.

We covered Text.Quantity and Text.Token in the Lexer section so we’re not going to repeat what they do here.

Lambda Calculus Lexer & Parser

LambdaCalculus.idr
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
import Data.List
import Data.List1
import Text.Lexer
import Text.Parser

%default total

data Expr = App Expr Expr | Abs String Expr | Var String | Let String Expr Expr

Show Expr where
  showPrec d (App e1 e2) = showParens (d == App) (showPrec (User 0) e1 ++ " " ++ showPrec App e2)
  showPrec d (Abs v e) = showParens (d > Open) ("\\" ++ v ++ "." ++ show e)
  showPrec d (Var v) = v
  showPrec d (Let v e1 e2) = showParens (d > Open) ("let " ++ v ++ " = " ++ show e1 ++ " in " ++ show e2)

data LambdaTokenKind
  = LTLambda
  | LTIdentifier
  | LTDot
  | LTOParen
  | LTCParen
  | LTIgnore
  | LTLet
  | LTEqual
  | LTIn

Eq LambdaTokenKind where
  (==) LTLambda LTLambda = True
  (==) LTDot LTDot = True
  (==) LTIdentifier LTIdentifier = True
  (==) LTOParen LTOParen = True
  (==) LTCParen LTCParen = True
  (==) LTLet LTLet = True
  (==) LTEqual LTEqual = True
  (==) LTIn LTIn = True
  (==) _ _ = False

Show LambdaTokenKind where
  show LTLambda = "LTLambda"
  show LTDot = "LTDot"
  show LTIdentifier = "LTIdentifier"
  show LTOParen = "LTOParen"
  show LTCParen = "LTCParen"
  show LTIgnore = "LTIgnore"
  show LTLet = "LTLet"
  show LTEqual = "LTEqual"
  show LTIn = "LTIn"

LambdaToken : Type
LambdaToken = Token LambdaTokenKind

Show LambdaToken where
  show (Tok kind text) = "Tok kind: " ++ show kind ++ " text: " ++ text

TokenKind LambdaTokenKind where
  TokType LTIdentifier = String
  TokType _ = ()

  tokValue LTLambda _ = ()
  tokValue LTIdentifier s = s
  tokValue LTDot _ = ()
  tokValue LTOParen _ = ()
  tokValue LTCParen _ = ()
  tokValue LTIgnore _ = ()
  tokValue LTLet _ = ()
  tokValue LTEqual _ = ()
  tokValue LTIn _ = ()

ignored : WithBounds LambdaToken -> Bool
ignored (MkBounded (Tok LTIgnore _) _ _) = True
ignored _ = False

identifier : Lexer
identifier = alpha <+> many alphaNum

keywords : List (String, LambdaTokenKind)
keywords = [
  ("let", LTLet),
  ("in", LTIn)
]

lambdaTokenMap : TokenMap LambdaToken
lambdaTokenMap = toTokenMap [(spaces, LTIgnore)] ++
  [(identifier, \s =>
      case lookup s keywords of
        (Just kind) => Tok kind s
        Nothing => Tok LTIdentifier s
    )
  ] ++ toTokenMap [
    (exact "\\", LTLambda),
    (exact ".", LTDot),
    (exact "(", LTOParen),
    (exact ")", LTCParen),
    (exact "=", LTEqual)
  ]

lexLambda : String -> Maybe (List (WithBounds LambdaToken))
lexLambda str =
  case lex lambdaTokenMap str of
    (tokens, _, _, "") => Just tokens
    _ => Nothing

mutual
  expr : Grammar state LambdaToken True Expr
  expr = do
    t <- term
    app t <|> pure t

  term : Grammar state LambdaToken True Expr
  term = abs
    <|> var
    <|> paren
    <|> letE

  app : Expr -> Grammar state LambdaToken True Expr
  app e1 = do
    e2 <- term
    app1 $ App e1 e2

  app1 : Expr -> Grammar state LambdaToken False Expr
  app1 e = app e <|> pure e

  abs : Grammar state LambdaToken True Expr
  abs = do
    match LTLambda
    commit
    argument <- match LTIdentifier
    match LTDot
    e <- expr
    pure $ Abs argument e

  var : Grammar state LambdaToken True Expr
  var = map Var $ match LTIdentifier

  paren : Grammar state LambdaToken True Expr
  paren = do
    match LTOParen
    e <- expr
    match LTCParen
    pure e

  letE : Grammar state LambdaToken True Expr
  letE = do
    match LTLet
    commit
    argument <- match LTIdentifier
    match LTEqual
    e1 <- expr
    match LTIn
    e2 <- expr
    pure $ Let argument e1 e2

parseLambda : List (WithBounds LambdaToken) -> Either String Expr
parseLambda toks =
  case parse expr $ filter (not . ignored) toks of
    Right (l, []) => Right l
    Right e => Left "contains tokens that were not consumed"
    Left e => Left (show e)

parse : String -> Either String Expr
parse x =
  case lexLambda x of
    Just toks => parseLambda toks
    Nothing => Left "Failed to lex."

Testing out our parser gives us back the following output:

$ idris2 -p contrib LambdaCalculus.idr
Main> :exec printLn $ parse "let name = world in (\\x.hello x) name"
Right (let name = world in (\x.hello x) name)

Expression Parser

Idris 2 also comes with a very convenient expression parser that is aware of precedence and associativity in Text.Parser.Expression.

The main function called buildExpressionParser takes in an OperatorTable and a Grammar that represents the terms, and returns a parsed expression. The magic comes from the OperatorTable since this table defines all the operators, the grammars for those operators, the precedence, and the associativity.

An OperatorTable is a list of lists containing the Op type. The Op type allows you to specify Prefix, Postfix, and Infix operators along with their grammars. Infix also contains the associativity called Assoc which can specify left associativity or AssocLeft, right associativity assoc or AssocRight and as being non-associative or AssocNone.

An example of an operator table we’ll be using for the calculator is:

[
  [ Infix (match CTMultiply >> pure (*)) AssocLeft
  , Infix (match CTDivide >> pure (/)) AssocLeft
  ],
  [ Infix (match CTPlus >> pure (+)) AssocLeft
  , Infix (match CTMinus >> pure (-)) AssocLeft
  ]
]

This table defines 4 operators for mulitiplication, division, addition and subtraction. Mulitiplication and division show up in the first table because they have higher precedence than addition and subtraction, which show up in the second table. We’re also defining them as infix operators, with a specific grammar and all being left associative via AssocLeft.

Building a Calculator

Calculator.idr
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
import Data.List1
import Text.Lexer
import Text.Parser
import Text.Parser.Expression

%default total

data CalculatorTokenKind
  = CTNum
  | CTPlus
  | CTMinus
  | CTMultiply
  | CTDivide
  | CTOParen
  | CTCParen
  | CTIgnore

Eq CalculatorTokenKind where
  (==) CTNum CTNum = True
  (==) CTPlus CTPlus = True
  (==) CTMinus CTMinus = True
  (==) CTMultiply CTMultiply = True
  (==) CTDivide CTDivide = True
  (==) CTOParen CTOParen = True
  (==) CTCParen CTCParen = True
  (==) _ _ = False

Show CalculatorTokenKind where
  show CTNum = "CTNum"
  show CTPlus = "CTPlus"
  show CTMinus = "CTMinus"
  show CTMultiply = "CTMultiply"
  show CTDivide = "CTDivide"
  show CTOParen = "CTOParen"
  show CTCParen = "CTCParen"
  show CTIgnore = "CTIgnore"

CalculatorToken : Type
CalculatorToken = Token CalculatorTokenKind

Show CalculatorToken where
    show (Tok kind text) = "Tok kind: " ++ show kind ++ " text: " ++ text

TokenKind CalculatorTokenKind where
  TokType CTNum = Double
  TokType _ = ()

  tokValue CTNum s = cast s
  tokValue CTPlus _ = ()
  tokValue CTMinus _ = ()
  tokValue CTMultiply _ = ()
  tokValue CTDivide _ = ()
  tokValue CTOParen _ = ()
  tokValue CTCParen _ = ()
  tokValue CTIgnore _ = ()

ignored : WithBounds CalculatorToken -> Bool
ignored (MkBounded (Tok CTIgnore _) _ _) = True
ignored _ = False

number : Lexer
number = digits

calculatorTokenMap : TokenMap CalculatorToken
calculatorTokenMap = toTokenMap [
  (spaces, CTIgnore),
  (digits, CTNum),
  (exact "+", CTPlus),
  (exact "-", CTMinus),
  (exact "*", CTMultiply),
  (exact "/", CTDivide)
]

lexCalculator : String -> Maybe (List (WithBounds CalculatorToken))
lexCalculator str =
  case lex calculatorTokenMap str of
    (tokens, _, _, "") => Just tokens
    _ => Nothing

mutual
  term : Grammar state CalculatorToken True Double
  term = do
    num <- match CTNum
    pure num

  expr : Grammar state CalculatorToken True Double
  expr = buildExpressionParser [
    [ Infix ((*) <$ match CTMultiply) AssocLeft
    , Infix ((/) <$ match CTDivide) AssocLeft
    ],
    [ Infix ((+) <$ match CTPlus) AssocLeft
    , Infix ((-) <$ match CTMinus) AssocLeft
    ]
  ] term

parseCalculator : List (WithBounds CalculatorToken) -> Either String Double
parseCalculator toks =
  case parse expr $ filter (not . ignored) toks of
    Right (l, []) => Right l
    Right e => Left "contains tokens that were not consumed"
    Left e => Left (show e)

parse1 : String -> Either String Double
parse1 x =
  case lexCalculator x of
    Just toks => parseCalculator toks
    Nothing => Left "Failed to lex."

Testing out our calculator gives us back the following output:

$ idris2 -p contrib Calculator.idr
Main> :exec printLn $ parse1 "1 + 2 - 3 * 4 / 5"
Right 0.6000000000000001