Marpa is a library written in C, with a Perl wrapper. It is dedicated to parsing. That is, people who wish to write parsers incorporate Marpa into their programs.
You start using Marpa by providing it with 2 inputs, a grammar and an input stream which may or may not conform to the given grammar.
This means Marpa parses the grammar and stores it in RAM in some convenient form. Such a grammar specifies both what tokens are acceptable, and the order (syntax) in which they may appear.
Next, Marpa parses your input stream and tests each token it finds there against the grammar you provided. This includes determining if the token appears at an acceptable position within the grammr, and hence at an acceptable position within the input stream.
It is in fact doing exactly the same as you are doing reading this text. For us humans, the grammar (e.g. English) is pre-loaded in our minds, and the input stream is any text we are reading or hearing (for instance, text, music or noise).
We test the latter against the grammar, and declare that it either matches what we expect, or fails to match.
Upon failure, we could switch the internal grammar we're using, from say English to Spanish or Chinese. If that fails we may well give up. That is, we conclude the input stream is grammatically incorrect, i.e. it contains one of more syntax errors, or, at the worst, it is visual (textual) or audible noise.
But if it matches, we conclude the input stream was grammatically correct, or roughly so, since humans don't always speak in grammatically perfect speech of course. Indeed, in songs and advertisments, proper grammar is often noticably distorted or even absent.
Likewise, Marpa attempts to verify whether or not the input stream does in fact match the grammar.
This is described as: The input was 'recognized' (or not) by Marpa. Hence the (Perl) variable $recce in the docs.
We can go further and say that recognizing the input can be be described as: The input stream conforms to the given grammar.
If it doesn't, and we're dealing with a fixed text which we want to match, then we modify the grammar step-by-step, until it Just Works. This tells us that the point of the test data is to test the grammar.
Deciding that work on the grammar is finished is equivalent to saying:
Every grammatically correct input stream will result in Marpa returning some sort of valid result, and
Every grammatically incorrect input stream will result in Marpa returning some sort of error result.
With the advent of Marpa's Scanless Interface, or SLIF, we need to feel comfortable with both Marpa's approach and the fact that, classically, lexing and parsing were often such visibly distinct phases, perhaps even requiring 2 programs.
Marpa's SLIF follows the classical separation of lexing and parsing, but it introduces a notation which is friendly to those programmers more accustomed to regular expressions and recursive descent.
So instead of two completely separate modules, lexer and parser, each with its own interface, you have a single source file for your domain-specific language (DSL), and the lexing and parsing statements look similar, with only the different operators ("::=" versus "~") to tell them apart.
Underneath however, in the implementation, the distinction is still there, as sharp as ever.
Marpa has 3 phases: Grammar pre-processing; Recognition; and Evaluation. That is, Marpa:
Precomputes a grammar.
Recognizes an input, creating a table.
Turns that table into a tree, and evaluates it.
Events occur during recognition. Actions are executed during the evaluation. In other words, by the time the first action is executed, the last event is ancient history.
And all this takes place in a single pass of the input stream, i.e. very efficiently indeed.
Consequently, Marpa's new design completely eliminates the necessity for you to write code to perform your own lexing.
The previous section is more of a terse, descriptive explanation of Marpa's internal processes. It's reasonable to ask: Is there another way to view this?
Well, take the first of the 3 points above: (Marpa) Precomputes a grammar. What does that mean?
Consider this fragment of a grammar:
:start ::= graph_grammar
graph_grammar ::= graph_definition action => graph
graph_definition ::= node_definition
node_definition ::= node_statement
| node_statement <graph definition>
node_statement ::= node_name
| node_name attribute_definition
| node_statement (',') node_statement
This set of rules can be represented graphically as a tree, rooted at the start rule. See this SVG version, as produced by an unfinished Perl module (MarpaX::Grammar::GraphViz2) which graphs grammars. This grammar comes from the module MarpaX::Demo::StringParser.
So, Marpa creates various data structures from such a grammar during the phase called 'Precomputes a grammar'. A simplistic view of this is that a grammar expressed as a multi-line string is converted into the same grammar expressed as a tree.
For more on this, see Wikipedia's article on Abstract Syntax Trees.
And these structures combined with the input stream, become the basis for the recognition and evaluation phases.
Briefly, the parsing process converted a grammar as a tree, plus input, into another structure, representable as a
tree, which is evaluated (analyzed) as per instructions embedded in the grammar. An example of such an instruction is
given above as
action => graph, which exact effect will be discussed later.
Regular expressions are a brilliant invention, and perfectly suited for many applications.
But the requirement to process structured grammars, and solutions such as Marpa, are what distinguises parsing from lexing and regular expressions.
Lexers and regular expressions recognize strings, but do not assign any structure to them.
When captures are added to regular expressions, you begin to see bits and pieces of a structure, but regular expressions and captures have a hard time with a structure of any complexity, or when the structure is recursive, they break down completely.
You can see such a case above, in the last rule, where 'node_statement' is defined to be a series of 'node_statement's separated by commas.
Theoretical computer science tells us that grammars like Marpa model a "context-free"
class of languages (CFGs), and cover the next level of expressivity beyond regular languages
(which is what "regular expressions" got their name after). While context-free languages are still limited enough
to guarantee efficient parsing algorithms, they are expressive enough to parse e.g. all programming languages and a
rather large subset of natural languages. As a hallmark example of a context-free structure, think of balancing
opening and closing parantheses. While a CFG grammar, which can be understood by Marpa, can directly model
that is not possible when using a single regular expression.
The easiest approach is to copy a suitable grammar. If that's not possible, we must write it, bit by bit.
And this leads to developing a set of test files, which purpose is to test the grammar also bit by bit. This includes tests that are deliberately faulty. That is, some tests must cause Marpa to report a syntax error in the input stream. Whether or not the grammar or the input stream is the real culprit is of course dependent on the user's expertise.
Marpa's grammars are written in what we call a SLIF-DSL. Here, SLIF stands for Scanless Interface, and DSL is Domain-specific Language.
Many programmers will have heard of BNF. Well, Marpa's SLIF-DSL is an extended BNF. That is, it includes special tokens which only make sense within the context of a Marpa grammar. Hence the 'Domain Specific' part of the name.
In practice, this means you express your grammar in a string, and Marpa treats that as a set of rules as to how you want Marpa to treat your input stream.
Marpa's docs for this are here.
Marpa is a Perl library. We will use Marpa::R2 for the examples in this guide.
If you have Perl already installed you can install Marpa with
$ cpan Marpa::R2
$ cpanm Marpa::R2
This will install Marpa::R2.
In the rest of this guide we'll assume you have installed Marpa::R2. The examples
that end in '.pl' can be executed using