In the following the different concepts and syntactical constructs of the grammar language are explained.
The first line in every grammar ...
grammar my.pack.SecretCompartments with org.eclipse.xtext.common.Terminals
declares the name of the grammar. Xtext leverages Java’s classpath mechanism. This means that the name can be any valid Java qualifier. The file name needs to correspond to the grammar name and have the file extension ' .xtext'. This means that the name has to be SecretCompartments.xtext and must be placed in a package my.pack somewhere on your project’s classpath.
The first line is also used to declare any used language (for mechanism details see Grammar Mixins).
Xtext parsers create in-memory object graphs while parsing text. Such object-graphs are instances of EMF Ecore models. An Ecore model basically consists of an EPackage containing EClasses, EDataTypes and EEnums (See the section on EMF for more details). Xtext can infer Ecore models from a grammar (see Ecore model inference) but it is also possible to reuse existing Ecore models. You can even mix this and use multiple existing Ecore models and infer some others from one grammar.
The easiest way to get started is to let Xtext infer the Ecore model from your grammar. This is what is done in the secret compartment example. To do so just state:
generate secretcompartment 'http://www.eclipse.org/secretcompartment'
That statement means: generate an EPackage with the name “secretcompartment” and the nsURI “http://www.eclipse.org/secretcompartment”. Actually these are the properties that are required to create an EPackage. Xtext will then add EClasses with properties ( EAttributes and EReferences) for the different rules, as described in Ecore model inference.
If you already have an existing EPackage, you can import it using either a namespace URI or a resource URI. An URI (Uniform Resource Identifier) provides a simple and extensible means for identifying an abstract or physical resource. For all the niftx details about EMF URIs please see its documentation.
In order to import an existing Ecore model, you’ll have to have the *.ecore file describing the EPackage you want to use somewhere in your workspace. In order to refer to that file you make use of the platform:/resource scheme. Platform URIs are a special EMF concept, which allow to reference elements in the workspace independent of the location of the workspace.
An import statement referring to an Ecore file by a platform:/resource/-URI looks like this:
import 'platform:/resource/my.project/src/my/pack/SecretCompartments.ecore'
If you want to mix generated and imported Ecore models you’ll have to configure the generator fragment in your MWE file responsible for generating the Ecore classes ( EcoreGeneratorFragment) with resource URIs that point to the generator models of the referenced Ecore models.
The *.genmodel provides all kind of generator configuration used by EMF’s code generator. Xtext will automatically create a *.genmodel for generated ecore models, but if such a model references an existing imported Ecore model, the code generator needs to know how that code was generated in order to generate valid references in Java for the new Ecore model.
Example:
fragment = org.eclipse.xtext.generator.ecore.EcoreGeneratorFragment {
genModels =
"platform:/resource/my.project/src/my/pack/SecretCompartments.genmodel"
}
We like to leverage Java’s classpath mechanism, because besides that it is well understood and designed it allows us to package libraries as jars. If you want to reference an *.ecore file which is contained in a jar, you can make use of the classpath scheme we’ve introduced. For instance if you want to reference Java elements, you can use the JvmType Ecore model which is shipped as part of Xtext.
Example:
import 'classpath:/model/JvmTypes.ecore' as types
As with platform resource URIs you’ll also have to tell the generator where the corresponding *.genmodel can be found:
fragment = org.eclipse.xtext.generator.ecore.EcoreGeneratorFragment {
genModels =
"classpath:/model/JvmTypes.genmodel"
}
See the section on Referring Java Types for a full explanation of this useful feature.
You can also use nsURI in order to import existing EPackage. Note that this is generally not preferrable, because you’ll have to have the corresponding EPackage installed in the workbench. There’s mainly just one exception, that is Ecore itself. So if you refer to Ecore it is best to use its nsURI :
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
If you want to use multiple EPackages you need to specify aliases in the following way:
generate secretcompartment 'http://www.eclipse.org/secretcompartment'
import 'http://www.eclipse.org/anotherPackage' as another
When referring to a type somewhere in the grammar you need to qualify the reference using that alias (example another::CoolType). We’ll see later where such type references occur.
It is also supported to put multiple EPackage imports into one alias. This is no problem as long as there are not any two EClassifiers with the same name. In such cases none of them can be referenced. It is even possible to import multiple and generate one Ecore model and all of them are declared for the same alias. If you do so, for a reference to an EClassifier first the imported EPackages are scanned before it is assumed that a type needs to by generated into the to-be-generated package.
Note, that using this feature is not recommended, because it might cause problems, which are hard to track down. For instance, a reference to classA would as well be linked to a newly created EClass, because the corresponding type in http://www.eclipse.org/packContainingClassA is spelled with a capital letter.
Basically parsing can be separated in the following phases.
lexing
parsing
linking
validation
In the first stage called lexing, a sequence of characters (the text input) is transformed into a sequence of so called tokens. In this context, a token is sort of a strongly typed part of the input sequence. It consists of one or more characters and was matched by a particular terminal rule or keyword and therefore represents an atomic symbol. Terminal rules are also referred to as token rules or lexer rules. There is an informal naming convention that names of terminal rules are all upper-case.
In the secret compartments example there are no explicitly defined terminal rules, since it only uses the ID rule which is inherited from the grammar org.eclipse.xtext.common.Terminals (cf. Grammar Mixins). Therein the ID rule is defined as follows:
terminal ID :
('^')?('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'_'|'0'..'9')*;
It says that a token ID starts with an optional '^' character (caret), followed by a letter (‘a’..‘z’|‘A’..‘Z’) or underscore (‘_’) followed by any number of letters, underscores and numbers (‘0’..‘9’).
The caret is used to escape an identifier for cases where there are conflicts with keywords. It is removed by the ID rule’s ValueConverter.
This is the formal definition of terminal rules:
TerminalRule :
'terminal' name=ID ('returns' type=TypeRef)? ':'
alternatives=TerminalAlternatives ';'
;
Note, that the order of terminal rules is crucial for your grammar, as they may hide each other. This is especially important for newly introduced rules in connection with mixed rules from used grammars.
If you for instance want to add a rule to allow fully qualified names in addition to simple IDs, you should implement it as a data type rule, instead of adding another terminal rule.
A terminal rule returns a value, which is a string (type ecore::EString) by default. However, if you want to have a different type you can specify it. For instance, the rule INT is defined as:
terminal INT returns ecore::EInt :
('0'..'9')+;
This means that the terminal rule INT returns instances of ecore::EInt. It is possible to define any kind of data type here, which just needs to be an instance of ecore::EDataType. In order to tell the parser how to convert the parsed string to a value of the declared data type, you need to provide your own implementation of IValueConverterService (cf. value converters). The value converter is also the point where you can remove things like quotes from string literals or the caret (‘^’) from identifiers. Its implementation needs to be registered as a service (cf. Service Framework).
Token rules are described using “Extended Backus-Naur Form”-like (EBNF) expressions. The different expressions are described in the following. The one thing all of these expressions have in common is the cardinality operator. There are four different possible cardinalities
exactly one (the default, no operator)
one or none (operator ?)
any (zero or more, operator *)
one or more (operator +)
Keywords are a kind of token rule literals. The ID rule in org.eclipse.xtext.common.Terminals for instance starts with a keyword:
terminal ID : '^'? .... ;
The question mark sets the cardinality to “none or one” (i.e. optional) like explained above.
Note that a keyword can have any length and contain arbitrary characters.
The following standard Java notations for special characters are allowed: \n, \r, \t, \b, and \f. We currently don’t support quoted unicode character notation, as \u123.
A character range can be declared using the ‘..’ operator.
Example:
terminal INT returns ecore::EInt: ('0'..'9')+;
In this case an INT is comprised of one or more (note the + operator) characters between (and including) '0' and '9'.
If you want to allow any character you can simple write the wildcard operator ' .' (dot): Example:
FOO : 'f' . 'o';
The rule above would allow expressions like ‘foo’, ‘f0o’ or even ‘f\no’.
With the until token it is possible to state that everything should be consumed until a certain token occurs. The multi-line comment is implemented this way:
terminal ML_COMMENT : '/*' -> '*/';
This is the rule for Java-style comments that begin with ' /*' and end with ' */'.
All the tokens explained above can be inverted using a preceding exclamation mark:
terminal BETWEEN_HASHES : '#' (!'#')* '#';
Rules can refer to other rules. This is done by writing the name of the rule to be called. We refer to this as rule calls. Rule calls in terminal rules can only point to terminal rules.
Example:
terminal DOUBLE : INT '.' INT;
Using alternatives one can state multiple different alternatives. For instance, the whitespace rule uses alternatives like this:
terminal WS : (' '|'\t'|'\r'|'\n')+;
That is a WS can be made of one or more whitespace characters (including ‘ ’, ' \t', ' \r', ' \n').
The parser reads a sequence of terminals and walks through the parser rules. Hence a parser rule - contrary to a terminal rule – does not produce a single terminal token but a tree of non-terminal and terminal tokens. They lead to a so called parse tree (in Xtext it is also referred as node model). Furthermore, parser rules are handled as kind of a building plan for the creation of the EObjects that form the semantic model (the linked abstract syntax graph or AST). Due to this fact, parser rules are even called production rules. The different constructs like actions and assignments are used to derive types and initialize the semantic objects accordingly.
Not all the expressions that are available in terminal rules can be used in parser rules. Character ranges, wildcards, the until token and the negation are only available for terminal rules.
The elements that are available in parser rules as well as in terminal rules are
keywords and
In addition to these elements, there are some expressions used to direct how the AST is constructed, which are listed and explained in the following.
Assignments are used to assign the parsed information to a feature of the current object. The type of the current object, its EClass, is specified by the return type of the parser rule. If it is not explicitly stated it is implied that the type’s name equals the rule’s name. The type of the feature is inferred from the right hand side of the assignment.
Example:
State :
'state' name=ID
('actions' '{' (actions+=[Command])+ '}')?
(transitions+=Transition)*
'end'
;
The syntactic declaration for states in the state machine example starts with a keyword state followed by an assignment:
name=ID
The left hand side refers to a feature ' name' of the current object (which has the EClass ' State' in this case). The right hand side can be a rule call, a keyword, a cross-reference (explained later) or even an alternative comprised by the former. The type of the feature needs to be compatible with the type of the expression on the right. As ID returns an EString in this case, the feature ' name' needs to be of type EString as well.
Assignment Operators
There are three different assignment operators, each with different semantics.
The simple equal sign ' =' is the straight forward assignment, and used for features which take only one element.
The ' +=' sign (the add operator) expects a multi-valued feature and adds the value on the right hand to that feature, which is a list feature.
The ' ?=' sign (boolean assignment operator) expects a feature of type EBoolean and sets it to true if the right hand side was consumed independently from the concrete value of the right hand side.
The used assignment operator does not effect the cardinality of the expected symbols on the right hand side.
A unique feature of Xtext is the ability to declare crosslinks in the grammar. In traditional compiler construction the crosslinks are not established during parsing but in a later linking phase. This is the same in Xtext, but we allow to specify crosslink information in the grammar. This information is used by the linker. The syntax for crosslinks is:
CrossReference :
'[' type=TypeRef ('|' ^terminal=CrossReferenceableTerminal )? ']'
;
For example, the transition is made up of two cross-references, pointing to an event and a state:
Transition :
event=[Event] '=>' state=[State]
;
It is important to understand that the text between the square brackets does not refer to another rule, but to a type! This is sometimes confusing, because one usually uses the same name for the rules and the returned types. That is if we had named the type for events differently like in the following the cross-reference needs to be adapted as well:
Transition :
event=[MyEvent] '=>' state=[State]
;
Event returns MyEvent : ....;
Looking at the syntax definition of cross-references, there is an optional part starting with a vertical bar (pipe) followed by CrossReferenceableTerminal. This is the part describing the concrete text, from which the crosslink later should be established. If the terminal is omitted, it is expected to be ID.
You may even use alternatives as the referencable terminal. This way, either an ID or a STRING may be used as the referencable terminal, as it is possible in many SQL dialects.
TableRef: table=[Table|(ID|STRING)];
Have a look at the linking section in order to understand how linking is done.
The elements of an unordered group can occur in any order but each element must appear once. Unordered groups are separated with ' &', e.g.
Modifier:
static?='static'? & final?='final'? & visibility=Visibility;
enum Visibility:
PUBLIC='public' | PRIVATE='private' | PROTECTED='protected';
allows
public static final
static protected
final private static
public
but not
static final static // ERROR: static appears twice
public static final private // ERROR: visibility appears twice
final // ERROR: visibility is missing
Note that if you want an element of an unordered group to appear once or not at all, you have to choose a cardinality of ' ?'. In the example, the visibility is mandatory, while ' static' or ' final' are optional. Elements with a cardinality of ' *' or ' +' have to appear continuously without interruption, i.e.
Rule:
values+=INT* & name=ID;
will parse
0 8 15 x
x 0 8 15
but not
0 x 8 15 // wrong, as values cannot be interrupted.
By default the object to be returned by a parser rule is created lazily on the first assignment. Then the type of the EObject to be created is determined from the specified return type or the rule name if no explicit return type is specified. With Actions however, the creation of returned EObject can be made explicit. Xtext supports two kinds of Actions:
simple actions, and
assigned actions.
If at some point you want to enforce the creation of a specific type you can use alternatives or simple actions. In the following example TypeB must be a subtype of TypeA. An expression A ident should create an instance of TypeA, whereas B ident should instantiate TypeB.
Example with alternatives:
MyRule returns TypeA :
"A" name=ID |
MyOtherRule
;
MyOtherRule returns TypeB :
"B" name = ID
;
Example with simple actions:
MyRule returns TypeA :
"A" name=ID |
"B" {TypeB} name=ID
;
Generally speaking, the instance is created as soon as the parser hits the first assignment. However, actions allow to explicitly instantiate any EObject. The notation {TypeB} will create an instance of TypeB and assign it to the result of the parser rule. This allows parser rules without any assignment and object creation without the need to introduce unnecessary rules.
We previously explained, that the EObject to be returned is created lazily when the first assignment occurs or when a simple action is evaluated. There is another way one can set the EObject to be returned, which we call an “unassigned rule call”.
Unassigned rule calls (the name suggests it) are rule calls to other parser rules, which are not used within an assignment. If there is no feature the returned value shall be assigned to, the value is assigned to the “to-be-returned” result of the calling rule.
With unassigned rule calls one can, for instance, create rules which just dispatch between several other rules:
AbstractToken :
TokenA |
TokenB |
TokenC
;
As AbstractToken could possibly return an instance of TokenA, TokenB or TokenC its type must by a super type of these types. It is now for instance as well possible to further change the state of the AST element by assigning additional things.
Example:
AbstractToken :
( TokenA |
TokenB |
TokenC ) (cardinality=('?'|'+'|'*'))?
;
This way the cardinality is optional (last question mark) and can be represented by a question mark, a plus, or an asterisk. It will be assigned to either an EObject of type TokenA, TokenB, or TokenC which are all subtypes of AbstractToken. The rule in this example will never create an instance of AbstractToken directly as long as the preceeding TokenX rule call returns an element.
LL-parsing has some significant advantages over LR algorithms. The most important ones for Xtext are, that the generated code is much simpler to understand and debug and that it is easier to recover from errors. Especially ANTLR has a very nice generic error recovery mechanism. This allows to construct an AST even if there are syntactic errors in the text. You wouldn’t get any of the nice IDE features as soon as there is one little error, if we hadn’t error recovery.
However, LL also has some drawbacks. The most important one is that it does not allow left recursive grammars. For instance, the following is not allowed in LL-based grammars, because Expression ‘+’ Expression is left recursive:
Expression :
Expression '+' Expression |
'(' Expression ')' |
INT
;
Instead one has to rewrite such things by “left-factoring” it:
Expression :
TerminalExpression ('+' TerminalExpression)?
;
TerminalExpression :
'(' Expression ')' |
INT
;
In practice this is always the same pattern and therefore not that problematic. However, by simply applying the Xtext AST construction features we’ve covered so far, a grammar ...
Expression :
{Operation} left=TerminalExpression (op='+' right=TerminalExpression)?
;
TerminalExpression returns Expression:
'(' Expression ')' |
{IntLiteral} value=INT
;
... would result in unwanted elements in the AST. For instance the expression (42) would result in a tree like this:
Operation {
left=Operation {
left=IntLiteral {
value=42
}
}
}
Typically one would only want to have one instance of IntLiteral instead.
One can solve this problem using a combination of unassigned rule calls and assigned actions:
Expression :
TerminalExpression ({Operation.left=current}
op='+' right=Expression)?
;
TerminalExpression returns Expression:
'(' Expression ')' |
{IntLiteral} value=INT
;
In the example above {Operation.left=current} is a so called tree rewrite action, which creates a new instance of the stated EClass ( Operation in this case) and assigns the element currently to-be-returned ( current variable) to a feature of the newly created object (in this case feature left of the Operation instance). In Java these semantics could be expressed as:
Operation temp = new Operation();
temp.setLeft(current);
current = temp;
Because parser rules describe not a single token, but a sequence of patterns in the input, it is necessary to define the interesting parts of the input. Xtext introduces the concept of hidden tokens to handle semantically unimportant things like whitespaces, comments, etc. in the input sequence gracefully. It is possible to define a set of terminal symbols, that are hidden from the parser rules and automatically skipped when they are recognized. Nevertheless, they are transparently woven into the node model, but not relevant for the semantic model.
Hidden terminals may (or may not) appear between any other terminals in any cardinality. They can be described per rule or for the whole grammar. When reusing a single grammar its definition of hidden tokens is reused as well. The grammar org.eclipse.xtext.common.Terminals comes with a reasonable default and hides all comments and whitespace from the parser rules.
If a rule defines hidden symbols, you can think of a kind of scope that is automatically introduced. Any rule that is called from the declaring rule uses the same hidden terminals as the calling rule, unless it defines other hidden tokens itself.
Person hidden(WS, ML_COMMENT, SL_COMMENT):
name=Fullname age=INT ';'
;
Fullname:
(firstname=ID)? lastname=ID
;
The sample rule “Person” defines multiline comments (_ML_COMMENT_), single-line comments (_SL_COMMENT_), and whitespace ( WS) to be allowed between the Fullname and the age. Because the rule Fullname does not introduce another set of hidden terminals, it allows the same symbols to appear between firstname and lastname as the calling rule Person. Thus, the following input is perfectly valid for the given grammar snippet:
John /* comment */ Smith // line comment
/* comment */
42 ; // line comment
A list of all default terminals like WS can be found in section Grammar Mixins.
Data type rules are parsing-phase rules, which create instances of EDataType instead of EClass. Thinking about it, one may discover that they are quite similar to terminal rules. However, the nice thing about data type rules is that they are actually parser rules and are therefore
context sensitive and
allow for use of hidden tokens.
If you, for instance, want to define a rule to consume Java-like qualified names (e.g. “foo.bar.Baz”) you could write:
QualifiedName :
ID ('.' ID)*
;
In contrast to a terminal rule this is only valid in certain contexts, i.e. it won’t conflict with the rule ID. If you had defined it as a terminal rule, it would possibly hide the ID rule.
In addition when this has been defined as a data type rule, it is allowed to use hidden tokens (e.g. “/* comment **/”) between the IDs and dots (e.g. foo/* comment */. bar . Baz).
Return types can be specified in the same way as in terminal rules:
QualifiedName returns ecore::EString :
ID ('.' ID)*
;
Note that if a rule does not call another parser rule and does neither contain any actions nor assignments, it is considered to be a data type rule and the data type EString is implied if none has been explicitly declared. You have to import Ecore with the alias ecore in this case.
For conversion again value converters are responsible (cf. value converters).
Enum rules return enumeration literals from strings. They can be seen as a shortcut for data type rules with specific value converters. The main advantage of enum rules is their simplicity, type safety and therefore nice validation. Furthermore it is possible to infer enums and their respective literals during the Ecore model transformation.
If you want to define a ChangeKind org.eclipse.emf.ecore.change/model/Change.ecore with ADD, MOVE and REMOVE you could write:
enum ChangeKind :
ADD | MOVE | REMOVE
;
It is even possible to use alternative literals for your enums or reference an enum value twice:
enum ChangeKind :
ADD = 'add' | ADD = '+' |
MOVE = 'move' | MOVE = '->' |
REMOVE = 'remove' | REMOVE = '-'
;
Please note, that Ecore does not support unset values for enums. If you formulate a grammar like
Element: "element" name=ID (value=SomeEnum)?;
with the input of
element Foo
the resulting value of the element Foo will hold the enum value with the internal representation of ‘0’ (zero). When generating the EPackage from your grammar this will be the first literal you define. As a workaround you could introduce a dedicated none-value or order the enums accordingly. Note that it is not possible to define an enum literal with an empty textual representation.
enum Visibility:
package | private | protected | public
;
You can overcome this by modifying the infered Ecore model through a model to model transformation.