More Loco examples

Following up from Loco, which introduced Loco.php, here are some more examples built using this parser combinator library.

Each of these examples enables you to parse a grammar and return a Grammar object capable of recognising that grammar.

Examples and demonstrations can be found in the unit testing sections at the bottom of each script. Alternatively, read on to see what each notation gives you:

Backus-Naur Form

BNF is generally pretty low-tech and lacks a lot of features.

Sample grammar in Backus-Naur Form

This appears on Wikipedia. This is a pretty clunky example because it doesn't handle whitespace and doesn't define a whole lot of variables which I had to do myself. However, it gets the point across.

<postal-address> ::= <name-part> <street-address> <zip-part>
<name-part>      ::= <personal-part> <name-part> | <personal-part> <last-name> <opt-jr-part> <EOL>
<personal-part>  ::= <initial> "." | <first-name>
<street-address> ::= <house-num> <street-name> <opt-apt-num> <EOL>
<zip-part>       ::= <town-name> "," <state-code> <ZIP-code> <EOL>
<opt-jr-part>    ::= "Sr." | "Jr." | <roman-numeral> | ""

<last-name>     ::= 'MacLaurin '
<EOL>           ::= '\n'
<initial>       ::= 'b'
<first-name>    ::= 'Steve '
<house-num>     ::= '173 '
<street-name>   ::= 'Acacia Avenue '
<opt-apt-num>   ::= '7A'
<town-name>     ::= 'Stevenage'
<state-code>    ::= ' KY '
<ZIP-code>      ::= '33445'
<roman-numeral> ::= 'g'

String in the sample grammar

Steve MacLaurin \n173 Acacia Avenue 7A\nStevenage, KY 33445\n

Wirth syntax notation

Wirth syntax notation is okay, but I don't like the use of . (which in my mind usually means "any character" (when used in a regex), or the string concatenation operator) as a line ending (which I usually think of as a semicolon or an actual \n). I also dislike the use of square brackets for optional terms, and braces for Kleene star closure. Neither of these are unambiguous enough in their meaning.

Sample grammar in Wirth syntax notation

SYNTAX     = { PRODUCTION } .
PRODUCTION = IDENTIFIER "=" EXPRESSION "." .
EXPRESSION = TERM { "|" TERM } .
TERM       = FACTOR { FACTOR } .
FACTOR     = IDENTIFIER
           | LITERAL
           | "[" EXPRESSION "]"
           | "(" EXPRESSION ")"
           | "{" EXPRESSION "}" .
IDENTIFIER = letter { letter } .
LITERAL    = """" character { character } """" .
digit      = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" .
upper      = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" 
           | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" 
           | "U" | "V" | "W" | "X" | "Y" | "Z" .
lower      = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" 
           | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" 
           | "u" | "v" | "w" | "x" | "y" | "z" .
letter     = upper | lower .
character  = letter | digit | "=" | "." | """""" .

This happens to be the grammar which describes Wirth syntax notation - if it had all whitespace removed from it. Observe:

String in the sample grammar

SYNTAX={PRODUCTION}.

Extended Backus-Naur Form example

This is a big improvement (comments are a must!) but the need for commas between tokens is irritating and again, braces and square brackets aren't ideal in my mind.

$ebnfGrammar can't handle "specials" (strings contained between two question marks), since these have no clear definition. It also can't handle "exceptions" (when a - is used to discard certain possibilities), because these are not permissible in context-free grammars or possible with naive MonoParsers, and so would require special modification to Loco.php to handle.

Sample grammar in Extended Backus-Naur Form

(* a simple program syntax in EBNF - Wikipedia *)
program = 'PROGRAM' , white space , identifier , white space ,
           'BEGIN' , white space ,
           { assignment , ";" , white space } ,
           'END.' ;
identifier = alphabetic character , { alphabetic character | digit } ;
number = [ "-" ] , digit , { digit } ;
string = '"' , { all characters } , '"' ;
assignment = identifier , ":=" , ( number | identifier | string ) ;
alphabetic character = "A" | "B" | "C" | "D" | "E" | "F" | "G"
                     | "H" | "I" | "J" | "K" | "L" | "M" | "N"
                     | "O" | "P" | "Q" | "R" | "S" | "T" | "U"
                     | "V" | "W" | "X" | "Y" | "Z" ;
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
white space = ( " " | "\n" ) , { " " | "\n" } ;
all characters = "H" | "e" | "l" | "o" | " " | "w" | "r" | "d" | "!" ;

String in the sample grammar

PROGRAM DEMO1
BEGIN
  A0:=3;
  B:=45;
  H:=-100023;
  C:=A;
  D123:=B34A;
  BABOON:=GIRAFFE;
  TEXT:=\"Hello world!\";
END."

Loco notation

Loco notation (for lack of a better name) is an extension of Backus-Naur Form which gives access to all the MonoParsers that Loco.php makes available. The following parsers are effectively available in most grammar notations:

  • EmptyParser - Just have an empty string or an empty right-hand side to a rule. Some notations also permit an explicit "epsilon" symbol.
  • StringParser - Invariably requires a simple string literal in single or double quotes.
  • ConcParser - Usually you put multiple tokens in a row and they will be matched consecutively. In EBNF, commas must be used as separators.
  • LazyAltParser - Alternation is achieved using a pipe, |, between possibilities.
  • GreedyMultiParser - Most notations provide some ability to make a match optional (typically square brackets), and/or to match an unlimited number of times (typically an asterisk or braces).

I had to invent new notation for the following:

  • RegexParser - Put your regex between slashes, just like in Perl.
  • Utf8Parser - To match any single UTF-8 character, put a full stop, .. To blacklist some characters, put the blacklisted characters between [^ and ].

In both cases I borrowed notation from the standard regular expression syntax, because why not stay with the familiar?

In all cases where a "literal" is provided (strings, regexes, UTF-8 exceptions), you can put the corresponding closing delimiter (i.e. ", ', / or ]) inside the "literal" by escaping it with a backslash. E.g.: "\"", '\'', /\//, [^\]]. You can also put a backslash itself, if you escape it with a second backslash. E.g.: "\\", '\\', /\\/, [^\\].

Sample grammar in Loco Backus-Naur Form

This can be used to recognise valid HTML comments here on qntm.org.

comment    ::= whitespace block*
block      ::= h5 whitespace | p whitespace
p          ::= '<p'      whitespace '>' text '</p'      whitespace '>'
h5         ::= '<h5'     whitespace '>' text '</h5'     whitespace '>'
strong     ::= '<strong' whitespace '>' text '</strong' whitespace '>'
em         ::= '<em'     whitespace '>' text '</em'     whitespace '>'
br         ::= '<br'     whitespace '/>'
text       ::= atom*
atom       ::= [^<>&] | '&' entity ';' | strong | em | br
entity     ::= 'gt' | 'lt' | 'amp'
whitespace ::= /[ \n\r\t]*/

See how I've put /[ \n\r\t]*/ to match an unlimited sequence of whitespace. This could be achieved using more rules and StringParsers, but RegexParsers are more powerful and more elegant.

Also see how I've put [^<>&] to match "any UTF-8 character except a <, a > or a &".

String in the sample grammar

<h5>  Title<br /><em\n><strong\n></strong>&amp;</em></h5>
   \r\n\t 
<p  >&lt;</p  >

Back to Code
Back to Things Of Interest

Facebook Twitter Reddit Email Hacker News StumbleUpon

Discussion (8)

2011-07-11 12:54:57 by Michael:

I'm going to have to sit down one day and seriously work out what the hell you are talking about here and in your other posts about parsers. I really like the idea, and I've got a project on the backburner because I don't know how to write a parser, or use someone elses...

A mess of regular expressions just isn't the same...

Thanks for putting your parsers, and comments up, I hope to make good use of them, when I can figure them out.


(Also, off topic, have you considered adding something like "tripcodes" so that commentators with the same name can be distinguished from each other?)

2011-07-13 23:26:16 by Sam:

I'll look into it. I have a bunch of projects to complete first but "make qntm.org better" is on there and some sort of user management system is on my list of things to consider.

2011-07-27 23:19:48 by OvermindDL:

Gravatars would help for that a lot, just have the user need to put their email as well as their name (which still needs to support numbers I say :) ), then it is just a simple <img> link to the gravatar.

2011-07-27 23:23:35 by Sam:

Noooo, no avatars for you

2011-07-28 22:53:49 by OvermindDL:

OpenID then? That is quite unique and still requires no database.

P.S. Please numbers in name? OvermindDL is not how I am searched, I go by OvermindDL1 everywhere, from my website address, emails, usernames, everything... Even my friends call me that in life (mostly because my given name is far too common to differentiate from everyone else).

2011-12-20 19:40:26 by carlo:

interesting work. But I can't recover from locoNotation the names of the matches parsers. After a successful match, I just see arrays always indexed starting from 0. Any clue?

2011-12-21 09:47:47 by Sam:

Yes, at the moment the "parse tree" is a simple associative array with indices beginning at 0. Actually using parser names to retrieve parts of the tree is a little more sophisticated because some parsers (i.e. the string literals and regex literals) don't have a name associated with them, some parsers appear more than once in the same output object (e.g. "whitespace" appears twice in any "p"), and some parsers are multiplied up (e.g. "block" appears starred inside "comment"). A parse tree object that would support the more advanced addressing needed to retrieve things like that would be more advanced than what I have currently. I suspect there is a standard notation for such things but I shall have to look into it.

2012-11-20 16:02:52 by jj:

I realize this is from a year ago (sorry) but re ^ What came of this? Is there any way to use the parse tree once you have it? Or do you just want it to be used for just validation not parsing/transformation/etc? Am I understanding wrong how it should be used (I have used Antlr in another life, never a parser combinator)