Chapter 2
Let's start with an example. What follows is the smallest code that parses a list of numbers.
use strict;
use Marpa::R2;
use Data::Dumper;
my $g = Marpa::R2::Scanless::G->new({
default_action => '::array',
source => \(<<'END_OF_SOURCE'),
:start ::= numbers
numbers ::= number+
number ~ [\d]+
:discard ~ ws
ws ~ [\s]+
END_OF_SOURCE
});
my $re = Marpa::R2::Scanless::R->new({ grammar => $g });
my $input = "1 3 5 8 13 21 34 55";
print "Trying to parse:\n$input\n\n";
$re->read(\$input);
my $value = ${$re->value};
print "Output:\n".Dumper($value);
In this example we parse a few numbers separated by spaces. Let's run the code first.
You can run the code by calling it with perl
.
$ perl number.pl
Trying to parse:
1 3 5 8 13 21 34 55
Output:
$VAR1 = [
'1',
'3',
'5',
'8',
'13',
'21',
'34',
'55'
];
The example shows which text it tries to parse followed by the array that it found.
In the next sections we'll just show the notation used by Marpa. The examples contain the full code.
:start ::= numbers
numbers ::= number+ action => ::array
number ~ [\d]+
:discard ~ ws
ws ~ [\s]+
In the specification language that Marpa uses, the :start
rule specifies
which rule is the top most rule that should match. The declaration operator
::=
separates the name of the rule, on left side, from the specification, on the right side.
The symbol on the left side of the declaration operator is the name of the rule. The symbols on the right side of the operator specify what it matches to.
According to the :start
declaration Marpa will start to parse from the
numbers
rule. The numbers
rule contains number
followed by a +
operator. The +
operator lets Marpa know that we expect one or more number
lexemes.
The next line specifies what a number looks like. The ~
operator separates the name of the token
from the character class on the right. We specify number
to be one or more digits. Marpa uses the
same character classes as Perl does internally.
Then we specify a :discard
rule. With this we can specify what tokens Marpa
can discard. The input language contains whitespace between the numbers. With
the ws
rule we say what this looks like.
This grammar will parse input strings like the following.
123 9 45 83 1000 1001 39201
The input string with numbers could be much longer than this as long as each number is separated by one or more whitespace characters. This includes spaces, tabs and newlines.
The rules ws
and number
are examples of lexical rules. A lexical rule
specifies which characters can be matched in the input string. A lexical rule
can contain character classes and quoted strings.
A character class specifies each character or group of characters that can be matched. These character classes are evaluated by Perl. This means that everything that would work in Perl itself, can be used in a Marpa character class. We already saw how to match a number. This is how you could match a variable name.
identifier ~ [_a-zA-Z] id_rest
id_rest ~ [_0-9a-zA-Z]*
The rule for matching an identifier (or name) is split into two rules. The first
rule specifies the structure of the token. The first part of it matches the
first character of an identifier. It can be an underscore, a lowercase
character or an uppercase character. It can't be a number however.
The second part id_rest
references the second rule, which specifies the
rest of the identifier. This can include numbers as well. The asterisk *
at the
end says that this character class can match zero or more times.
The asterisk is also the reason why we split the identifier rule into two parts. The asterisk can only be used on rules with a single item on the right hand side.
A lexical rule can also use a quoted string. A quoted string is surrounded
with single quotes '
. The text between the quotes will match as literal text.
You could use this for single characters or keywords.
kw_for ~ 'for'
kw_if ~ 'if'
In the next chapter we look some more at how to use lexical rules in a bigger picture.