Sophie

Sophie

distrib > Mandriva > 2010.0 > i586 > media > contrib-release > by-pkgid > cd14cddf3b3ceaf1193157472227757a > files > 26

parrot-doc-1.6.0-1mdv2010.0.i586.rpm

=pod

=head1 Grammar Engine

X<Parrot Grammar Engine>
X<PGE (Parrot Grammar Engine)>
The Parrot Grammar Engine (PGE) is a parser generator, one of the key
components of the Parrot Compiler Toolkit. It reads grammar files written in
the PGE rules format and generates parser modules written in PIR code. PGE
rules provide the full power of I<recursive descent parsing> and I<operator
precedence parsing>. Fortunately, you don't need to know what those terms
mean in order to make good use of PGE. We'll introduce the necessary
concepts as we talk about various features in this chapter.

=head2 Grammars

The ultimate goal of a parser is to match patterns in a source language and
convert them to an internal data structure for later manipulations. As a
programmer, you're probably already familiar with some of these types of
patterns: function declarations, function calls, statements, and assignments.
Each of these different concepts have a particular form called a I<syntax>.
In C for example, the syntax to define a function looks something like this:

  <return_type> <function_name> ( <arguments> ) { <function_body> }

Things that fit this pattern, so long as all the sub-patterns use the proper
syntax also, are valid subroutines in C. Similarly, we can use a slightly
different pattern to create a subroutine:

  sub <function_name> { <function_body> }

A grammar is a collection of rules like the ones above that specify all the
acceptable patterns in a language. Grammars group together these rules in
much the same way that a groups together related data fields and methods
N<In languages like Perl 6 for instance, a grammar is just a special kind
of class and a rule is just a special kind of method.>. Each rule defines
a pattern for matching one unit of text, and can be made up of various other
rules which are called recursively to make a complete match.

A rule can contain regular expressions to match patterns of characters:

  rule id { \d+ }

A rule can also contain patterns of references to other rules:

  rule record { <id> <name> <phone> }

A grammar contains a group of rules that work together to match the entire
language:

  grammar Contacts;

  rule name { 'John' | 'Bob ' | 'Fred' }

  rule id   { \d+ }

  rule record { <id> <name> }

  ...

=head3 Rules and Tokens

X<rule>
X<token>
There are two different kinds of rules: C<rule>, which we saw above, and
C<token>. A C<rule> performs smart whitespace matching between the various
pieces of the pattern. The C<record> rule given previously would match
"6355 John" or "6355      John" but not "6355John".

A C<token> matches whitespace only if you specifically request it. To get the
same effect with a token, add the C<\s> (match a space character) and C<+>
(match the preceding atom -- the space character, in this case -- one or more
times) pattern to the rule:

  token record { <id> \s+ <name> }

=head3 The Start Rule

X<top>
X<top-down parser>
A recursive descent parser is what's called a I<top-down parser>. It starts
at the highest-level rule, called C<TOP>, and works its way down through
individual rules to match an entire string or file. Real Perl 6 allows any
name for the top-level rule, but PCT expects a rule called C<TOP>. If PCT
was as fully-featured as Perl 6, people would use it instead! Here's an
example of a TOP rule:

  rule TOP { <record> }

This rule matches a single C<record> pattern in a string or file. Once the
parser has succeeded in matching the entire string or file passed to the
start rule, it returns a parse tree. If it cannot match the entire input
with the rules provided, it can either return a partial match, or it can
throw a parse error.

=head3 Testing a Grammar

Let's do a small example grammar. Save this example to a file called
F<Contacts.pg>:

  grammar Contacts is PGE::Grammar;

  rule  TOP    { <record> }
  rule  record { <id> <name> }
  token name   { 'John' | 'Bob ' | 'Fred' }
  token id     { \d+ }

Then compile the grammar:

  $ B<parrot Perl6Grammar.pbc --output=Contacts.pir Contacts.pg>

=for author

Assume an installed Parrot for all examples?  Anyone working from the source
tree should be able to mangle paths appropriately.

=end for

The path to F<parrot> and to the F<Perl6Grammar.pbc> file will vary on
different systems. If you compiled Parrot from source, it will be:

  $ B<./parrot runtime/parrot/library/PGE/Perl6Grammar.pbc \>
        B<--output=Contacts.pir Contacts.pg>

Next, create a small PIR script to run your grammar. Save it as
F<grammar_test.pir>:

=begin PIR

  .sub main :main
      load_bytecode 'PGE.pbc'        # load some required modules
      load_bytecode 'dumper.pbc'
      load_bytecode 'PGE/Dumper.pbc'

      load_bytecode 'Contacts.pir'   # load your grammar

      .local string source
      source  = "3 John"

      .local pmc top, grammar, match
      top     = get_hll_global ['Contacts'], 'TOP'
      grammar = get_class 'Contacts'
      match   = top(source, 'grammar' => grammar)

      _dumper(match, "match")
  .end

=end PIR

Run the test script:

  $ B<parrot grammar_test.pir>

It will print out a text representation of the raw parse tree stored in the
C<match> variable:

  "match" => PMC 'Contacts' => "3 John" @ 0 {
      <record> => PMC 'Contacts' => "3 John" @ 0 {
          <id> => PMC 'Contacts' => "3" @ 0
          <name> => PMC 'Contacts' => "John" @ 2
      }
  }

Each node in the tree corresponds to a rule in the grammar.  The top-level
match variable contains one child named C<record>, which contains two children
named C<id> and C<name>.  C<id> contains the number 3, and C<name> contains the
string "John". This is exactly what the simple grammar should have matched.

=head2 Rule Syntax

Every language has a set of basic components (words or parts of words) and
syntax conventions for combining them. The "words" in rules are literal
characters or symbols, some X<metacharacters> metacharacters (or metasymbols),
and X<rules;escape sequences>X<escape sequences, rules> escape sequences, while
the combining syntax includes other metacharacters, X<quantifiers, rules>
X<rules;quantifiers> quantifiers, bracketing characters, and assertions.

=head3 Metacharacters

The C<.> metacharacter matches any single character, even a newline character.
The C<^> and C<$> metacharacters are zero-width matches which represent the
beginning and end of a string. They each have doubled alternates C<^^> and
C<$$> that match at the beginning and end of every (newline-delimited) line
within a string.

The C<|>, C<&>, C<\>, C<#>, and C<:=> metacharacters are all syntax structure
elements. C<|> alternates between two options. C<&> matches two patterns
simultaneously (the patterns must be the same length). C<\> turns literal
characters into metacharacters (producing escape sequences). C<#> starts a
comment which proceeds until the end of the line. You can start a comment at
any point on any line in a rule. C<:=> binds a hypothetical variable to the
result of a subrule or grouped pattern (see L<Hypothetical Variables>).

The metacharacters C<()>, C<[]>, C<{}> and C<E<lt>E<gt>> are bracketing pairs.
Bracketing pairs must always be balanced within the rule; to use a literal
character, escape it with a C<\>.  The C<()> and C<[]> pairs group patterns as
a single atom. They often capture a result, mark the boundaries of an
alternation, or mark a group of patterns with a quantifier. Parentheses C<()>
capture, but square brackets C<[]> do not. The C<{}> brackets define a section
of code (a closure) within a rule. These closures are always a successful
zero-width match. The C<E<lt>...E<gt>> brackets mark assertions, which handle a
variety of constructs including character classes and user-defined quantifiers
(see L<Assertions>).

Table 7-2 summarizes the basic metacharacters.

=begin table picture Metacharacters

Z<CHP-7-TABLE-2>

=headrow

=row

=cell Symbol

=cell Meaning

=bodyrows

=row

=cell C<.>

=cell Match any single character, including a newline.
X<. (dot);. match single character (rules)>

=row

=cell C<^>

=cell Match the beginning of a string.
X<^ (caret);^ beginning of string (rules)>

=row

=cell C<$>

=cell Match the end of a string.
X<$ (dollar sign);$ end of string (rules)>

=row

=cell C<^^>

=cell Match the beginning of a line within the string.
X<^ (caret);^^ beginning of line (rules)>

=row

=cell C<$$>

=cell Match the end of a line within the string.
X<$ (dollar sign);$$ end of line (rules)>

=row

=cell C<|>

=cell Match alternate patterns (OR).

=row

=cell C<&>

=cell Match multiple patterns (AND).

=row

=cell C<\>

=cell Escape a metacharacter to get a literal character, or escape a
literal character to get a metacharacter.
X<\ (backslash);\ escape sequences (rules)>
X<\ (backslash);\ to escape metacharacters (rules)>

=row

=cell C<#>

=cell Mark a comment (to the end of the line).

=row

=cell C<:=>

=cell Bind the result of a match to a hypothetical variable.
X<: (colon);:= (binding);in rules>

=row

=cell C<(...)>

=cell Group patterns and capture the result.

=row

=cell C<[...]>

=cell Group patterns without capturing.

=row

=cell C<{...}>

=cell Execute a closure (Perl 6 code) within a rule.

=row

=cell C<E<lt>...E<gt>>

=cell Match an assertion.

=end table

=head3 Escape Sequences

Z<CHP-7-SECT-2.2>

X<escape sequences, rules>
X<rules;escape sequences>
X<\ (backslash);\ escape sequences (rules)>

Escape sequences are literal characters acting as metacharacters.  A preceding
backslash (C<\>) identifies them as escapes. Some escape sequences represent
single characters that are difficult to represent literally, such as C<\t> for
tab, or C<\x[...]> to specify a character by its hexadecimal number.  Some
represent limited character classes, such as C<\d> for digits or C<\w> for word
characters. Some represent zero-width positions in a match, such as C<\b> for a
word boundary.

X<variable interpolation in rules>
X<rules;variable interpolation>
If you've used Perl 5 regexps, you may remember the C<\Q> escape sequence which
treats everything until the following C<\E> sequence as literal text,
containing no escape sequences.  Because ordinary variables now interpolate as
literal strings by default, the C<\Q> escape sequence is rarely needed.

A<CHP-7-TABLE-3>Table 7-3 shows the escape sequences for rules.

=begin table picture Escape sequences

Z<CHP-7-TABLE-3>

=headrow

=row

=cell Escape

=cell Meaning

=bodyrows

=row

=cell C<\0[...]>

=cell Match a character given in octal (brackets optional).

=row

=cell C<\b>

=cell Match a word boundary.

=row

=cell C<\B>

=cell Match when not on a word boundary.

=row

=cell C<\c[...]>

=cell Match a named character or control character.

=row

=cell C<\C[...]>

=cell Match any character except the bracketed named or control character.

=row

=cell C<\d>

=cell Match a digit.

=row

=cell C<\D>

=cell Match a non-digit.

=row

=cell C<\e>

=cell Match an escape character.

=row

=cell C<\E>

=cell Match anything but an escape character.

=row

=cell C<\f>

=cell Match the form feed character.

=row

=cell C<\F>

=cell Match anything but a form feed.

=row

=cell C<\n>

=cell Match a (logical) newline.

=row

=cell C<\N>

=cell Match anything but a (logical) newline.

=row

=cell C<\h>

=cell Match horizontal whitespace.

=row

=cell C<\H>

=cell Match anything but horizontal whitespace.

=row

=cell C<\L[...]>

=cell Everything within the brackets is lowercase.

=row

=cell C<\Q[...]>

=cell All metacharacters within the brackets match as literal characters.

=row

=cell C<\r>

=cell Match a return.

=row

=cell C<\R>

=cell Match anything but a return.

=row

=cell C<\s>

=cell Match any whitespace character.

=row

=cell C<\S>

=cell Match anything but whitespace.

=row

=cell C<\t>

=cell Match a tab.

=row

=cell C<\T>

=cell Match anything but a tab.

=row

=cell C<\U[...]>

=cell Everything within the brackets is uppercase.

=row

=cell C<\v>

=cell Match vertical whitespace.

=row

=cell C<\V>

=cell Match anything but vertical whitespace.

=row

=cell C<\w>

=cell Match a word character (Unicode alphanumeric characters plus the
underscore C<_>).

=row

=cell C<\W>

=cell Match anything but a word character.

=row

=cell C<\x[...]>

=cell Match a character given in hexadecimal (brackets optional).

=row

=cell C<\X[...]>

=cell Match anything but the character given in hexadecimal (brackets
optional).

=end table

=head3 Quantifiers

Z<CHP-7-SECT-2.3>

Quantifiers specify the number of times an atom (a single character,
metacharacter, escape sequence, grouped pattern, assertion, etc) will match.

X<. (dot);.. (range);quantifier (rules)>
X<. (dot);... (infinite range);quantifier (rules)>
The numeric quantifiers use assertion syntax. A single number (C<E<lt>3E<gt>>)
requires exactly that many matches. A numeric range quantifier
(C<E<lt>3C<..>5E<gt>>) succeeds if the number of matches is between the minimum
and maximum numbers, inclusive. A range with three trailing dots
(C<E<lt>2...E<gt>>) is shorthand for C<E<lt>R<n>..InfE<gt>>; it matches as many
times as possible.

Each quantifier has a minimal alternate form -- marked with a trailing C<?> --
which matches the shortest possible sequence first.  That is, given the string
C<aaaaaa>, C<aE<lt>3C<..>5E<gt>> will match C<aaaaa> and C<aE<lt>3C<..>5E<gt>?>
will match C<aaa>.

A<CHP-7-TABLE-4>Table 7-4 shows the built-in
X<quantifiers, rules> X<rules;quantifiers> quantifiers.

=begin table picture Quantifiers

Z<CHP-7-TABLE-4>

=headrow

=row

=cell Maximal

=cell Minimal

=cell Meaning

=bodyrows

=row

=cell C<*>

=cell C<*?>

=cell Match 0 or more times.

=row

=cell C<+>

=cell C<+?>

=cell Match 1 or more times.

=row

=cell C<?>

=cell C<??>

=cell Match 0 or 1 times.

=row

=cell C<E<lt>>R<n>C<E<gt>>

=cell C<E<lt>>R<n>C<E<gt>?>

=cell Match exactly R<n> times.

=row

=cell C<E<lt>>R<n>C<..>R<m>C<E<gt>>

=cell C<E<lt>>R<n>C<..>R<m>C<E<gt>?>

=cell Match at least R<n> and no more than R<m> times.

=row

=cell C<E<lt>>R<n>C<...E<gt>>

=cell C<E<lt>>R<n>C<...E<gt>?>

=cell Match at least R<n> times.

=end table

=head3 Assertions

Z<CHP-7-SECT-2.4>

X<assertions, rules>
X<rules;assertions>
An assertion states that some condition or state is true. The match fails when
that assertion is false.

X<variable interpolation in rules>
X<rules;variable interpolation>

Assertions match named and anonymous rules, arrays or hashes containing
anonymous rules, and subroutines or closures that return anonymous rules.

To interpolate a variable in assertion rules, enclose it in assertion
delimiters.
A bare scalar in a pattern
interpolates as a literal string, while a scalar variable in assertion
brackets interpolates as an anonymous rule. A bare array in a pattern
matches as a series of alternate literal strings, while an array in
assertion brackets interpolates as a series of alternate anonymous
rules.

A bare hash in a pattern matches a word (C<\w+>) if and only if that word is
one of its keysN<The effect is similar to matching the keys as a series of
alternates, but it prefers to match the longest possible key, instead of the
first potential match.>, while a hash in assertion brackets also matches the
associated value as an anonymous rule.

X<fail keyword>
A bare closure in a pattern always matches (unless it calls C<fail>), but a
closure in assertion brackets C<E<lt>{...}E<gt>> must return an anonymous rule
to match.

An assertion with parentheses C<E<lt>(...)E<gt>> resembles a bare closure in a
pattern in that it allows you to include Perl code within a rule.
C<E<lt>(...)E<gt>> evaluates the return value of the closure in boolean
context. The match succeeds or fails based on that return value.

Assertions match character classes, both named and enumerated. A named rule
character class is often more accurate than an enumerated character class. The
common C<E<lt>[a-zA-Z]E<gt>> idiom matches ASCII alphabetic characters, but the
more comprehensive built-in rule C<E<lt>alphaE<gt>> matches the full set of
Unicode alphabetic characters.

A<CHP-7-TABLE-5>Table 7-5 shows the syntax of assertions.

=begin table picture Assertions

Z<CHP-7-TABLE-5>

=headrow

=row

=cell Syntax

=cell Meaning

=bodyrows

=row

=cell C<E<lt>...E<gt>>

=cell Generic assertion delimiter.

=row

=cell C<E<lt>!...E<gt>>

=cell Negate any assertion.

=row

=cell C<E<lt>>R<name>C<E<gt>>

=cell Match a named rule or character class.

=row

=cell C<E<lt>[...]E<gt>>

=cell Match an enumerated character class.

=row

=cell C<E<lt>-...E<gt>>

=cell Complement a character class (named or enumerated).

=row

=cell C<E<lt>"..."E<gt>>

=cell Match a literal string (interpolated at match time).

=row

=cell C<E<lt>'...'E<gt>>

=cell Match a literal string (not interpolated).

=row

=cell C<E<lt>(...)E<gt>>

=cell Boolean assertion. Execute a closure and match if it returns a true
result.

=row

=cell C<E<lt>$scalarE<gt>>

=cell Match an anonymous rule.

=row

=cell C<E<lt>@arrayE<gt>>

=cell Match a series of anonymous rules as alternates.

=row

=cell C<E<lt>%hashE<gt>>

=cell Match a key from the hash, then its value (as an anonymous rule).

=row

=cell C<E<lt>E<amp>sub()E<gt>>

=cell Match an anonymous rule returned by a sub.

=row

=cell C<E<lt>{>R<code>C<}E<gt>>

=cell Match an anonymous rule returned by a closure.

=row

=cell C<E<lt>.E<gt>>

=cell Match any logical grapheme, including combining character sequences.

=end table

=head3 Modifiers

Z<CHP-7-SECT-2.5>

X<modifiers>
X<: (colon);: modifier delimiter in rules>
Modifiers alter the meaning of a pattern. The standard position for modifiers
is at the beginning of the rule, right after the C<m>, C<s>, or C<rx>, or after
the name in a named rule. Modifiers cannot attach to the outside of a bare
C</.../>. For example:

  m:i/marvin/ # case insensitive
  rule names :i { marvin | ford | arthur }

You may group single-character modifiers, but you must separate longer
modifiers by colons:

  m:wig/ zaphod /                        # OK
  m:words:ignorecase:globally / zaphod / # OK
  m:wordsignorecaseglobally / zaphod /   # Not OK

Most modifiers can also appear inside the rule when attached to rule or
grouping delimiters. Internal modifiers are lexically scoped to their enclosing
delimiters, so can alter subpatterns:

  m/:w I saw [:i zaphod] / # only 'zaphod' is case insensitive

The repetition modifiers (C<:R<N>x>, C<:R<N>th>, C<:once>, C<:globally>, and
C<:exhaustive>) and the continue modifier (C<:cont>) alter the return value of
the rule as a whole, so you cannot use them lexically inside a rule.

The C<:R<N>x> modifier matches the rule a specific number of times. If the
modifier expects more matches than the string has, the match fails.  Its
alternate form (C<:x(R<N>)>) can take a variable in place of the number.

The C<:once> modifier on a rule only allows it to match once. The rule will not
match again until the you call the C<.reset> method on the rule object.

The C<:globally> modifier matches as many times as possible. The C<:exhaustive>
modifier also matches as many times as possible, in as many different ways as
possible.

The C<:R<N>th> modifier preserves one result from a particular counted match.
If the rule matches fewer times than the modifier expects, the match fails. It
has several alternate forms. One form, C<:th(R<N>)>, takes a variable in place
of the number. The other forms -- C<:R<N>st>, C<:R<N>nd>, and C<:R<N>rd> --
allow you to write more naturally C<:1st>, C<:2nd>, C<:3rd>.  The other way is
valid as well; choose whichever is most comfortable.

By default, rules ignore literal whitespace within the pattern.  The C<:w>
modifier makes rules sensitive to literal whitespace, but in an intelligent
way. Any cluster of literal whitespace acts like an explicit C<\s+> when it
separates two identifiers and C<\s*> everywhere else.

I<No> modifiers exist to treat the matched string as a single line or multiple
lines.  Instead, use the "beginning of string" and "end of string" or
"beginning of line" and "end of line" metacharacters.

A<CHP-7-TABLE-6>Table 7-6 lists the available modifiers.

=begin table picture Modifiers

Z<CHP-7-TABLE-6>

=headrow

=row

=cell Short

=cell Long

=cell Meaning

=bodyrows

=row

=cell C<:i>

=cell C<:ignorecase>

=cell Case-insensitive match.

=row

=cell C<:I>

=cell

=cell Case-sensitive match (on by default).

=row

=cell C<:c>

=cell C<:cont>

=cell Continue where the previous match on the string left off.

=row

=cell C<:w>

=cell C<:words>

=cell Literal whitespace in the pattern matches as C<\s+>
or C<\s*>.

=row

=cell C<:W>

=cell

=cell Turn off intelligent whitespace matching (return to default).

=row

=cell

=cell :R<N>C<x>/C<:x(>R<N>C<)>

=cell Match the pattern R<N> times.

=row

=cell

=cell C<:>R<N>C<th>/C<:nth(>R<N>C<)>

=cell Match the R<N>th occurrence of a pattern.

=row

=cell

=cell C<:once>

=cell Match the pattern once and only once.

=row

=cell C<:g>

=cell C<:globally>

=cell Match the pattern as many times as possible without overlapping
possibilities.

=row

=cell C<:e>

=cell C<:exhaustive>

=cell Match every possible occurrence of a pattern, including overlapping
possibilities.

=row

=cell

=cell C<:u0>

=cell . is a byte.

=row

=cell

=cell C<:u1>

=cell . is a Unicode codepoint.

=row

=cell

=cell C<:u2>

=cell . is a Unicode grapheme.

=row

=cell

=cell C<:u3>

=cell . is language dependent.

=row

=cell

=cell C<:p5>

=cell The pattern uses Perl 5 regex syntax.

=end table

=head3 Built-in Rules

Z<CHP-7-SECT-3>

X<rules;built-in>
PGE provides several named rules, including a complete set of X<POSIX-style
classes> POSIX-style classes, and X<Unicode property classes> Unicode property
classes. The list isn't fully defined yet, but A<CHP-7-TABLE-7>Table 7-7 shows
a few you're likely to see.

The C<E<lt>nullE<gt>> rule matches a zero-width string (it always matches) and
C<E<lt>priorE<gt>> matches whatever the most recent successful rule matched.
These replace the two behaviors of X</ (slash);// invalid null pattern>
X<invalid null pattern //> the Perl 5 null pattern C<//>, which is no longer
valid syntax for rules.

=begin table picture Built-in rules

Z<CHP-7-TABLE-7>

=headrow

=row

=cell Rule

=cell Meaning

=bodyrows

=row

=cell C<E<lt>alphaE<gt>>

=cell Match a Unicode alphabetic character.

=row

=cell C<E<lt>digitE<gt>>

=cell Match a Unicode digit.

=row

=cell C<E<lt>spE<gt>>

=cell Match a single space character (the same as C<\s>).

=row

=cell C<E<lt>wsE<gt>>

=cell Match any whitespace (the same as C<\s+>).

=row

=cell C<E<lt>nullE<gt>>

=cell Match the null string.

=row

=cell C<E<lt>priorE<gt>>

=cell Match the same thing as the previous match.

=row

=cell C<E<lt>before ...E<gt>>

=cell Zero-width lookahead. Assert that the current position I<precedes> a
pattern.

=row

=cell C<E<lt>after ...E<gt>>

=cell Zero-width lookbehind. Assert that the current position I<follows> a
pattern.

=row

=cell C<E<lt>prop ...E<gt>>

=cell Match any character with the named property.

=row

=cell C<E<lt>replace(...)E<gt>>

=cell Replace everything matched so far in the rule or subrule with the
given string (under consideration).

=end table

=head3 Backtracking Control

Z<CHP-7-SECT-4>

X<backtracking controls>
X<fail keyword>
Whenever part of the pattern fails to match, PGE performs backtracking --
backing up to the previous point at which the match could succeed and trying
again.  You can explicitly trigger backtracking by calling the C<fail> function
within a closure. A<CHP-7-TABLE-8>Table 7-8 displays metacharacters and
built-in rules relevant to backtracking.

=for author

This could use an example.

=end for

=begin table picture Backtracking controls

Z<CHP-7-TABLE-8>

=headrow

=row

=cell Operator

=cell Meaning

=bodyrows

=row

=cell C<:>

=cell Don't retry the previous atom.  Instead, fail to the next earlier atom.
X<: (colon);: fail to atom before last (rules)>
X<backtracking controls;: fail to atom before last>

=row

=cell C<::>

=cell Don't backtrack over this point. Instead fail out of the closest
enclosing group (C<(...)>, C<[...]>, or the rule delimiters).
X<: (colon);:: fail out of group (rules)>
X<backtracking controls;: fail out of group>

=row

=cell C<:::>

=cell Don't backtrack over this point.  Instead, fail out of the current rule
or subrule.
X<: (colon);::: fail out of rule (rules)>
X<backtracking controls;: fail out of rule>

=row

=cell C<E<lt>commitE<gt>>

=cell Don't backtrack over this point. Instead, fail out of the entire match
(even from within a subrule).

=row

=cell C<E<lt>cutE<gt>>

=cell Like C<E<lt>commitE<gt>>, but also cuts the string matched. The current
matching position at this point becomes the new beginning of the string.

=end table

=head3 Calling Actions

Once the parser has matched the entire input N<a source code file, or a line of
input at the terminal in interactive mode> the parse has succeeded.  The
generated AST is now available to the code generator for conversion into PIR.

=for author

Please review.  The forward declaration is awkward here, but a little bit of
explanation might ameliorate this.

=end for

This AST gets built up by actions -- code snippets attached to rules and
tokens.  To call an action, insert the C<{*}> token into the rule. When PGE
encounters C<{*}>, it will call the associated action method with the current
match object as an argument.

The best way to demonstrate this is by example.  Sprinkle the C<persons_name>
rule liberally with action calls:

 rule persons_name {
    {*} <first_name> {*} <last_name> {*}
 }

The first call to the action method contains an empty match object because the
parser hasn't matched anything yet.  The second call contains only the first
name of the match. The third and final call contains both the matched first and
last name.

If the match fails halfway through, PGE will still call the actions that have
succeeded; it will not call the actions after the failure.  If you try to match
the string "Leia", PGE will call the first two action methods.  When the rule
tries to match the last name, it fails, and PGE will not call the third action
method.

=head3 Alternations and Keys

In addition to sub-rules, groups, and quantifiers, you can also express
either-or alternations between options. The vertical bar token (C<|>)
distinguishes between options where only one may match:

 rule hero {
    ['Luke' | 'Leia'] 'Skywalker'
 }

This rule will match either "Luke Skywalker" or "Leia Skywalker" but won't
match "Luke Leia Skywalker"N<nor anything else.>.  Given alternations and
action methods, it's often important to distinguish which alternation matched:

 rule hero {
    [
      'Luke' {*}    #= Luke
    | 'Leia' {*}    #= Leia
    ]
    'Skywalker'
 }

This is the same rule, except now it passes two arguments to its action method:
the match object and the name of the person who matched.

=head3 Warning: Left Recursion

If you've worked with parsers before, you may have seen this coming.  If not,
don't fear.  Like functions in ordinary procedural or functional languages, the
methods in the PGE parser grammar can call themselves recursively.  Consider
some rules derived in part from the grammar for the C programming language:

 rule if_statement {
    'if' <condition> '{' <statement>* '}' <else_block>?
 }

 rule statement {
    <if_statement> | <expression>
 }

 rule else_block {
    'else' '{' <statements>* '}'
 }

An C<if_statement> can contain a list of C<statement>s, and that each statement
may itself be an C<if_statement>.  This is I<recursion> X<Recursion>; it's one
of the reasons PGE is a "Recursive descent" parser.

Consider the more direct example of a comma-separated list of integer digits
which form a list.  A recursive definition might be:

 rule list {
     <list> ',' <digit> | <digit>
 }

If there is only one digit, the second option in the alternation matches.  If
there are multiple digits, recursion will match them through the first
alternation.

That's the intention.  The results are insidious.

The recursive descent parser enters the C<list> rule. Its first option is to
enter the list rule again, so it does.  Recursive descent is a X<depth-first
algorithm> depth-first algorithm; PGE will continue to descend down a
particular path until it finds a successful match or a match failure. In this
case, it matches C<list>, then it matches C<list> again, then it matches
C<list> again, and so on.  This rule forms an infinite loop -- a pattern called
X<left recursion> I<left recursion>.  The problem is that the left-most item of
the left-most alternation is itself a recursion.

The rule above does not recurse infinitely when rewritten as:

 rule list {
    <digit> | <list> ',' <digit>
 }

... or even:

 rule list {
    <digit> ',' <list> | <digit>
 }

Both options ensure that the left-most item in the rule is recursive.

Left recursion may be trickier.  It's not immediately obvious in this grammar:

 rule term {
    <expression> '*' <term> | <digit>
 }

 rule expression {
    <term> '+' <expression> | <term>
 }

Even this common, limited subset of mathematical equations has the same
problem.  To match a C<term>, the parser first tries to match an C<expression>,
which in turn matches a C<term> and then an C<expression> ....

Again, the solution is simple.  Rewrite at least one of the rules so that the
first condition it tries to match is not itself a recursive situation.

=head3 Operator Precedence Parser

Recursive descent parsing can be inefficient where statements have lots of
little tokens and many possible options to match.  For example, mathematical
expressions are very open-ended, with many valid forms which are difficult to
anticipate.  Consider the expression:

 a + b * c + d

A recursive descent parser will undergo significant trial and error to parse
this statement.  Recursive descent parsing is not ideal for these situations.
Instead, a type of bottom-up parser called an I<operator precedence> X<Parser,
Operator precedence> parser is much better.

=for author

Is this a categorization of all opps or just PGE's opp?

=end for

Operator precedence parsers work similarly to more versatile bottom-up parsers
such as Lex or Yacc, but are optimized for use with expressions and equations.
Equations have two subtypes, I<terms> and I<operators>. Operators themselves
have several subtypes, including prefix (C<-a>), postfix (C<i++>), infix (C<x +
y>), circumfix (C<[z]>), postcircumfix (C<a[b]>), and list (C<1, 2, 3>). Each
operator gets its own precedence number that specifies how closely it binds to
the terms. The previous example should parse as:

 a + (b * c) + d

... because the C<*> operator has a higher precedence -- binding more tightly
to its terms -- than the C<+> operator.

Within a grammar, switch from the top-down recursive descent parser to the
bottom-up operator precedence parser with an C<optable> X<Parser, optable>
rule:

 rule expression is optable { ... }

The C<...> ellipsis isn't an editorial shortcut, it's the Perl 6 operator to to
define a function signature. The C<...> indicates that this is just a
signature; the actual implementation is elsewhere.  In this case, that location
in the definition of the optable.

=head3 Protofunction Definitions

X<Protofunctions>

Protofunctions define operators in the optable in the same way that rules and
tokens make up the grammar. A proto declares a rule, defined elsewhere, which
other code may override dynamically.  In this case, PCT takes information from
the proto declaration and fills in the details. The "dynamic overriding"
implies that a high-level language itself itself can modify its own grammar at
run time, by overriding the proto definitions for its operator table. Some
languages call this process X<operator overloading> I<operator overloading>.

A proto definition resembles:

 'proto' <proto_name> [ 'is' <property> ] '{' '...' '}'

The name of the operator, noted as C<< <proto_name> >>, contains both a
location part and an identifier part. The location is the type of the operator,
such as infix, postfix, prefix, circumfix, and postcircumfix. The name of the
operator is the symbol used for the operator in any of the quotes that Perl 6
understands:

 proto infix:<+>                  # a + b
 proto postfix:'--'               # i--
 proto circumfix:«<>»             # <x>

The C<is> X<Parser, is> keyword defines a property of the rule. Examples
include:

 is precedence(1)     # Specifies an exact precedence
 is equiv('+')        # Has the same precedence as the "+" operator
 is assoc('right')    # Right associative. May also be "left" or "list"
 is pirop('add')      # Operands are passed to the PIR operator "and"
 is subname('mySub')  # Operands are passed to the function "mySub"
 is pasttype('if')    # Operands are passed as children to an "if" PAST node in
                      # the parse tree
 is parsed(&myRule)   # The token is parsed and identified using the rule
                      # "myRule" from the top-down parser

=for author

Please review.

=end for

Protofunction definitions are function signatures; you can override them with
multimethod dispatch. This means that you can write functions I<with the same
name> as the rule to implement the behavior of the operator.  Here's a proto:

 rule infix:"+" { ... }

... and its corresponding PIR rule:

=begin PIR

 .sub 'infix:+'
    .param pmc a
    .param pmc b
    .local pmc c
    c = a + b
    .return(c)
 .end

=end PIR

You may ask "Why have an C<is subname()> property, if you can define all
operators as subroutines?" Using the C<is subname()> property allows PCT to
call a subroutine of a different name then the operator.  This is a good idea
if there is already a built-in function in the language that duplicates the
functionality of the operator.  There is no sense in duplicating behavior.

The great thing about protos being overloadable is that you can specify
different functions to call with different signatures:

=begin PIR

 .sub 'infix:+' :multi('Integer', 'Integer')
    #...
 .end

 .sub 'infix:+' :multi('CLispRatio', 'Number')
    #...
 .end

 .sub 'infix:+' :multi('Perl6Double', 'PythonInteger')
    #...
 .end

=end PIR

This list can be a bit intimidating, and it's hard to imagine that it would be
necessary to write up a new function to handle addition between every
conceivable pair of operands. Fortunately, this is rarely the case in Parrot,
because all these data types support common the VTABLE interface. For most data
types Parrot already has basic arithmetic operations built in, and it's only
necessary to override for those data types with special needs.

=head3 Hypothetical Variables

Z<CHP-7-SECT-5>

X<variables;hypothetical>
X<hypothetical variables>
X<rules;captures>
Hypothetical variables are a powerful way of building up data structures from
within a match. Ordinary captures with C<()> store the result of the captures
in C<$1>, C<$2>, etc. PGE stores values in these variables if the match is
successful, but throws them away if the match fails.  The numbered capture
variables are accessible outside the match, but only within the immediate
surrounding lexical scope:

  "Zaphod Beeblebrox" ~~ m:w/ (\w+) (\w+) /;

  print $1; # prints Zaphod

You can also capture into any user-defined variable with the binding operator
C<:=> -- I<if> you have declared these variables in a lexical scope enclosing
the rule:

  my $person;
  "Zaphod's just this guy." ~~ / ^ $person := (\w+) /;
  print $person; # prints Zaphod

You may capture repeated matches into an array:

  my @words;
  "feefifofum" ~~ / @words := (f<-[f]>+)* /;
  # @words contains ("fee", "fi", "fo", "fum")

You may capture pairs of repeated matches into a hash:

  my %customers;
  $records ~~ m:w/ %customers := [ E<lt>idE<gt> = E<lt>nameE<gt> \n]* /;

If you don't need the captured value outside the rule, use a C<$?> variable
instead. These are only directly accessible within the rule:

  "Zaphod saw Zaphod" ~~ m:w/ $?name := (\w+) \w+ $?name/;

A match of a named rule stores the result in a C<$?> variable with the same
name as the rule. These variables are also accessible only within the rule:

  "Zaphod saw Zaphod" ~~ m:w/ E<lt>nameE<gt> \w+ $?name /;

=for author

This next paragraph feels out of place; is there more?

=end for

When a rule matches a sequence of input tokens, PCT calls an associated method
within NQP to convert that match into an AST node, which it inserts into the
I<parse tree>.

=head3 Basic Rules

Consider the simple example rule:

 rule persons_name {
    <first_name> <last_name>
 }

... and two example tokens:

 token first_name { <alpha>+ }
 token last_name  { <alpha>+ }

The special token C<< <alpha> >> is a built-in construct that only accepts
upper case and lower case letters. The C<+> after the C<< <alpha> >> tag is a
short way of saying "one or more". The rule will match names like C<Darth
Vader>N<It also matches many strings that I<aren't> real names>, but won't
match something like C<C 3P0>.

This rule I<will> match C<Jar Jar Binks>, but not as you might expect: way you
would expect: It would match the first "Jar" as C<< <first_name> >>, the second
"Jar" as C<< <last_name> >>, and ignore "Binks"N<You should ignore the whole
thing.>.

=for author

The rest seems vestigial.  An example like this should precede the rest of the
chapter.  There are forward references, but it's a decent overview for people
who haven't used similar systems before -- if you avoid going out in the weeds.

=end for

this example shows another new construct, the square brackets. Square
brackets are ways to group things together. The star at the end means
that we take all the things inside the brackets zero or more times.
This is similar to the plus, except the plus matches one or more times.
Notice, however, that the above rule always matches a comma at the end,
so we would need to have something like:

 Darth Vader, Luke Skywalker,

Instead of something more natural like:

 Darth Vader, Luke Skywalker

We can modify the rule a little bit so that it always ends with a name
instead of a comma:

 rule TOP {
    [ <persons_name> ',' ]* <persons_name>
 }

Now we don't need a trailing comma, but at the same time we can't match
an empty file because it always expects to have at least one name at the
end. If we still want to match empty files successfully, we need to make
the whole rule optional:

 rule TOP {
    [ [ <persons_name> ',' ]* <persons_name> ]?
 }

We've grouped the whole rule together in another set of brackets, and
put a "?" question mark at the end. The question mark means zero or
one of the prior item.

The symbols "*" (zero or more), "+" (one or more) and "?" are called
I<quantifiers>, and allow an item in the rule to match a variable
number of times. These aren't the only quantifiers, but they are the
most common. We will talk about other quantifiers later on.

=cut

# Local variables:
#   c-file-style: "parrot"
# End:
# vim: expandtab shiftwidth=4: