Chapter 2

Parsing Numbers


Let's start with an example. What follows is the smallest code that parses a list of numbers.

use strict;
use Marpa::R2;
use Data::Dumper;

my $g = Marpa::R2::Scanless::G->new({
        default_action => '::array',
        source         => \(<<'END_OF_SOURCE'),

:start        ::= numbers
numbers       ::= number+
number          ~ [\d]+
:discard        ~ ws
ws              ~ [\s]+


my $re = Marpa::R2::Scanless::R->new({ grammar => $g });
my $input = "1 3 5 8 13 21 34 55";

print "Trying to parse:\n$input\n\n";
my $value = ${$re->value};
print "Output:\n".Dumper($value);


In this example we parse a few numbers separated by spaces. Let's run the code first. You can run the code by calling it with perl.

$ perl
Trying to parse:
1 3 5 8 13 21 34 55

$VAR1 = [

The example shows which text it tries to parse followed by the array that it found.

In the next sections we'll just show the notation used by Marpa. The examples contain the full code.

:start        ::= numbers
numbers       ::= number+            action => ::array
number          ~ [\d]+
:discard        ~ ws
ws              ~ [\s]+


In the specification language that Marpa uses, the :start rule specifies which rule is the top most rule that should match. The declaration operator ::= separates the name of the rule, on left side, from the specification, on the right side.

The symbol on the left side of the declaration operator is the name of the rule. The symbols on the right side of the operator specify what it matches to.

According to the :start declaration Marpa will start to parse from the numbers rule. The numbers rule contains number followed by a + operator. The + operator lets Marpa know that we expect one or more number lexemes.

The next line specifies what a number looks like. The ~ operator separates the name of the token from the character class on the right. We specify number to be one or more digits. Marpa uses the same character classes as Perl does internally.

Then we specify a :discard rule. With this we can specify what tokens Marpa can discard. The input language contains whitespace between the numbers. With the ws rule we say what this looks like.

This grammar will parse input strings like the following.

123    9 45 83 1000 1001          39201

The input string with numbers could be much longer than this as long as each number is separated by one or more whitespace characters. This includes spaces, tabs and newlines.

Lexical rules

The rules ws and number are examples of lexical rules. A lexical rule specifies which characters can be matched in the input string. A lexical rule can contain character classes and quoted strings.

A character class specifies each character or group of characters that can be matched. These character classes are evaluated by Perl. This means that everything that would work in Perl itself, can be used in a Marpa character class. We already saw how to match a number. This is how you could match a variable name.

identifier      ~ [_a-zA-Z] id_rest
id_rest         ~ [_0-9a-zA-Z]*


The rule for matching an identifier (or name) is split into two rules. The first rule specifies the structure of the token. The first part of it matches the first character of an identifier. It can be an underscore, a lowercase character or an uppercase character. It can't be a number however. The second part id_rest references the second rule, which specifies the rest of the identifier. This can include numbers as well. The asterisk * at the end says that this character class can match zero or more times.

The asterisk is also the reason why we split the identifier rule into two parts. The asterisk can only be used on rules with a single item on the right hand side.

A lexical rule can also use a quoted string. A quoted string is surrounded with single quotes '. The text between the quotes will match as literal text. You could use this for single characters or keywords.

kw_for    ~ 'for'
kw_if     ~ 'if'

In the next chapter we look some more at how to use lexical rules in a bigger picture.

Next step

Chapter 3: A small language