Overview (Lexer) - NetBeans API Javadoc (Current Development Version)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

org.netbeans.modules.lexer/2 1.19.0 1

PREV NEXT

FRAMES NO FRAMES

Lexer
Official

The lexer module defines LexerAPI providing access to sequence of tokens for various input sources.

See:
Description

Lexer

org.netbeans.api.lexer The entrance point into Lexer API is TokenHierarchy class with its static methods that provide its instance for the given input source.

org.netbeans.spi.lexer The main abstract class in the Lexer SPI that must be implemented is LanguageHierarchy that mainly defines set of token ids and token categories for the new language and its Lexer.

Lexer
org.netbeans.api.lexer	The entrance point into Lexer API is `TokenHierarchy` class with its static methods that provide its instance for the given input source.
org.netbeans.spi.lexer	The main abstract class in the Lexer SPI that must be implemented is `LanguageHierarchy` that mainly defines set of token ids and token categories for the new language and its `Lexer`.

The lexer module defines LexerAPI providing access to sequence of tokens for various input sources.
An API entry point is TokenHierarchy class with its static methods that provide its instance for the given input source.

Input Sources

TokenHierarchy can be created for immutable input sources ( CharSequence or java.io.Reader ) or for mutable input sources (typically javax.swing.text.Document ).
For mutable input source the lexer framework updates the tokens in the token hierarchy automatically with subsequent changes to the underlying text input. The tokens of the hierarchy always reflect the text of the input at the given time.

TokenSequence and Token

TokenHierarchy.tokenSequence() allows to iterate over a list of Token instances.
The token carries a token identification TokenId (returned by Token.id() ) and a text (aka token body) represented as CharSequence (returned by Token.text() ).
TokenUtilities contains many useful methods related to operations with the token's text such as TokenUtilities.equals(CharSequence text1, CharSequence text2), TokenUtilities.startsWith(CharSequence text, CharSequence prefix), etc.
It is also possible to debug the text of the token (replace special chars by escapes) by TokenUtilities.equals(CharSequence text).
A typical token also carries offset of its occurrence in the input text.

Flyweight Tokens

As there are many token occurrences where the token text is the same for all or many occurrences (e.g. java keywords, operators or a single-space whitespace) the memory consumption can be decreased considerably by allowing the creation of flyweight token instances i.e. just one token instance is used for all the token's occurrences in all the inputs.
Flyweight tokens can be determined by Token.isFlyweight().
The flyweight tokens do not carry a valid offset (their internal offset is -1).
Therefore TokenSequence is used for iteration through the tokens (instead of a regular iterator) and it provides TokenSequence.offset() which returns the proper offset even when positioned over a flyweight token.
When holding a reference to the token's instance its offset can also be determined by Token.offset(TokenHierarchy tokenHierarchy). The tokenHierarchy parameter is usually null which gets the offset for the "live" token hierarchy (a snapshot token hierarchy might be used as the parameter too).
For flyweight tokens the Token.offset(TokenHierarchy tokenHierarchy) returns -1 and for regular tokens it gives the same value like TokenSequence.offset().

There may be applications where the flyweight tokens use could be problematic. For example if a parser would like to use token instances in a parse tree nodes to determine the nodes' boundaries then the flyweight tokens would always return offset -1 so the positions of the parse tree nodes could not generally be determined from the tokens only.
Therefore there is a possibility to de-flyweight a token by using TokenSequence.offsetToken() which checks the current token and if it's flyweight then it replaces it with a non-flyweight token instance with a valid offset and with the same properties as the original flyweight token.

TokenId and Language

Token is identified by its id represented by TokenId interface. Token ids for a language are typically implemented as java enums (extensions of Enum ) but it's not mandatory.
All token ids for the given language are described by Language.
Each token id may belong to one or more token categories that allow to better operate tokens of the same type (e.g. keywords or operators).
Each token id may define its primary category TokenId.primaryCategory() and LanguageHierarchy.createTokenCategories() may provide additional categories for the token ids for the given language.
Each language description has a mandatory mime-type specification Language.mimeType()
Although it's a bit non-related information it brings many benefits because with the mime-type the language can be accompanied with an arbitrary sort of settings (e.g. syntax coloring information etc.).

LanguageHierarchy, Lexer, LexerInput and TokenFactory

SPI providers wishing to provide a Language first need to define its SPI counterpart LanguageHierarchy. It mainly needs to define token ids in LanguageHierarchy.createTokenIds() and lexer in LanguageHierarchy.createLexer(LexerInput lexerInput, TokenFactory tokenFactory, Object state, LanguagePath languagePath, InputAttributes inputAttributes).
Lexer reads characters from LexerInput and breaks the text into tokens.
Tokens are produced by using methods of TokenFactory.
As a per-token memory consumption is critical the Token does not have any counterpart in SPI. However the framework prevents instantiation of any other token classes except those contained in the lexer module's implementation.

Language Embedding

With language embedding the flat list of tokens becomes in fact a tree-like hierarchy represented by the TokenHierarchy class. Each token can potentially be broken into a sequence of embedded tokens.
The TokenSequence.embedded() method can be called to obtain the embedded tokens (when positioned on the branch token).
There are two ways of specifying what language is embedded in a token. The language can either be specified explicitly (hardcoded) in the LanguageHierarchy.embedding() method or there can be a LanguageProvider registered in the default Lookup, which will create a Language for the embedded language.
There is no limit on the depth of a language hierarchy and there can be as many embedded languages as needed.
In SPI the language embedding is represented by LanguageEmbedding.

Token Hierarchy Snapshots

Token hierarchy allows to create a snapshot of itself at any given time by using TokenHierarchy.createSnapshot().
Subsequent modifications to the "live" token hierarchy will not affect tokens of the snapshot.
Snapshot creation is cheap because initially the snapshot shares all the tokens with the live hierarchy. With subsequent modifications the initial and ending areas (where the tokens are shared between snapshot and live hierarchy) are maintained by the snapshot. In the middle the snapshot retains the tokens originally present in the live token hierarchy.

What is New (see all changes)?

May 16 '07 Removed previously added Language.refresh().

Removed previously added Language.refresh() since there is an alternative in using LanguageProvider.firePropertyChange(PROP_LANGUAGE).
Apr 27 '07 Added Language.refresh().

Added Language.refresh() to allow languages framework and other clients to update contents of a language dynamically.
Apr 13 '07 Added TokenHierarchy.tokenSequenceList().

Added TokenHierarchy.tokenSequenceList() to find token sequences having certain language path throughout the whole input source or just within given offset bounds.
Also added LanguagePath.embedded(language) and LanguagePath.embedded(suffixLanguagePath).
Mar 23 '07 Added TokenChange.isBoundsChange().

Added TokenChange.isBoundsChange() to check for changes that only modify token bounds (see method's javadoc).
Improved incrementality for embedded sections for bounds changes.
Mar 1 '07 Added PartType for token parts support

Added PartType enum and Token.partType() that identifies whether the token is COMPLETE or which part of a complete token this part represents (START, INNER or END).

Use Cases

API Usecases

Obtaining of token hierarchy for various inputs.

The TokenHierarchy is an entry point into Lexer API and it represents the given input in terms of tokens.

    String text = "public void m() { }";
    TokenHierarchy hi = TokenHierarchy.create(text, JavaLanguage.description());

Token hierarchy for swing documents must be operated under read/write document's lock.

    document.readLock();
    try {
        TokenHierarchy hi = TokenHierarchy.get(document);
        ... // explore tokens etc.
    } finally {
        document.readUnlock();
    }

Obtaining and iterating token sequence over particular swing document from the given offset.

The tokens cover the whole document and it's possible to iterate either forward or backward.
Each token can contain language embedding that can also be explored by the token sequence. The language embedding covers the whole text of the token (there can be few characters skipped at the begining an end of the branch token).

    document.readLock();
    try {
        TokenHierarchy hi = TokenHierarchy.get(document);
        TokenSequence ts = hi.tokenSequence();
        // If necessary move ts to the requested offset
        ts.move(offset);
        while (ts.moveNext()) {
            Token t = ts.token();
            if (t.id() == ...) { ... }
            if (TokenUtilities.equals(t.text(), "mytext")) { ... }
            if (ts.offset() == ...) { ... }

            // Possibly retrieve embedded token sequence
            TokenSequence embedded = ts.embedded();
            if (embedded != null) { // Token has a valid language embedding
                ...
            }
        }
    } finally {
        document.readUnlock();
    }

Typical clients:

Editor's painting code doing syntax coloring org.netbeans.modules.lexer.editorbridge.LexerLayer in lexer/editorbridge module.
Brace matching code searching for matching brace in forward/backward direction.
Code completion's quick check whether caret is located inside comment token.
Parser constructing a parse tree iterating through the tokens in forward direction.

Using language path of the token sequence

For the given token sequence the client may check whether it's a top level token sequence in the token hierarchy or whether it's embedded at which level it's embedded and what are the parent languages.
Each token can contain language embedding that can also be explored by the token sequence. The language embedding covers the whole text of the token (there can be few characters skipped at the begining an end of the branch token).

    TokenSequence ts = ...
    LanguagePath lp = ts.languagePath();
    if (lp.size() > 1) { ... } // This is embedded token sequence
    if (lp.topLanguage() == JavaLanguage.description()) { ... } // top-level language of the token hierarchy
    String mimePath = lp.mimePath();
    Object setting-value = some-settings.getSetting(mimePath, setting-name);

Creating token hierarchy snapshot for token hierarchy over a mutable input

Token hirerarchy snapshots allow to create a snapshot of the token hierarchy at the given time and guarantee that it will not be affected by subsequent input text modifications.
Token snapshot is represented as a TokenHierarchy instance so the creation of token sequence etc. is the same like for regular token hierarchy.

    private TokenHierarchy snapshot;
    
    document.readLock();
    try {
        TokenHierarchy hi = TokenHierarchy.get(document);
        snapshot = hi.createSnapshot();
        // Possible later modifications will not affect tokens of the snapshot
    } finally {
        document.readUnlock();
    }

    ...

    document.readLock();
    try {
        TokenSequence ts = snapshot.tokenSequence();
        ...
    } finally {
        document.readUnlock();
    }

Typical clients:

Parser constructing a parse tree. The parser may retain "last good snapshot" for the edited file - a snapshot when it was possible to parse the document without errors.

Extra information about the input

The InputAttributes class may carry extra information about the text input on which the token hierarchy is being created. For example there can be information about the version of the language that the input represents and the lexer may be written to recognize multiple versions of the language. It should suffice to do the versioning through a simple integer:

public class MyLexer implements Lexer<MyTokenId> {
    
    private final int version;
    
    ...
    
    public MyLexer(LexerInput input, TokenFactory<MyTokenId> tokenFactory, Object state,
    LanguagePath languagePath, InputAttributes inputAttributes) {
        ...
        
        Integer ver = (inputAttributes != null)
                ? (Integer)inputAttributes.getValue(languagePath, "version")
                : null;
        this.version = (ver != null) ? ver.intValue() : 1; // Use version 1 if not specified explicitly
    }
    
    public Token<MyTokenId> nextToken() {
        ...
        if (recognized-assert-keyword) {
            return (version >= 4) { // "assert" recognized as keyword since version 4
                ? keyword(MyTokenId.ASSERT)
                : identifier();
        }
        ...
    }
    ...
}

The client will then use the following code:

    InputAttributes attrs = new InputAttributes();
    // The "true" means global value i.e. for any occurrence of the MyLanguage including embeddings
    attrs.setValue(MyLanguage.description(), "version", Integer.valueOf(3), true);
    TokenHierarchy hi = TokenHierarchy.create(text, false, SimpleLanguage.description(), null, attrs);
    ...

Filtering out unnecessary tokens

Filtering is only possible for immutable inputs (e.g. String or Reader).

    Set<MyTokenId> skipIds = EnumSet.of(MyTokenId.COMMENT, MyTokenId.WHITESPACE);
    TokenHierarchy tokenHierarchy = TokenHierarchy.create(inputText, false,
        MyLanguage.description(), skipIds, null);
    ...

Typical clients:

Parser constructing a parse tree. It is not interested in the comment and whitespace tokens so these tokens do not need to be constructed at all.

SPI Usecases

Providing language description and lexer.

Token ids should be defined as enums. For example org.netbeans.lib.lexer.test.simple.SimpleTokenId can be copied or the following example from org.netbeans.modules.lexer.editorbridge.calc.lang.CalcTokenId.
The static language() method returns the language describing the token ids.

public enum CalcTokenId implements TokenId {

    WHITESPACE(null, "whitespace"),
    SL_COMMENT(null, "comment"),
    ML_COMMENT(null, "comment"),
    E("e", "keyword"),
    PI("pi", "keyword"),
    IDENTIFIER(null, null),
    INT_LITERAL(null, "number"),
    FLOAT_LITERAL(null, "number"),
    PLUS("+", "operator"),
    MINUS("-", "operator"),
    STAR("*", "operator"),
    SLASH("/", "operator"),
    LPAREN("(", "separator"),
    RPAREN(")", "separator"),
    ERROR(null, "error"),
    ML_COMMENT_INCOMPLETE(null, "comment");


    private final String fixedText;

    private final String primaryCategory;

    private CalcTokenId(String fixedText, String primaryCategory) {
        this.fixedText = fixedText;
        this.primaryCategory = primaryCategory;
    }
    
    public String fixedText() {
        return fixedText;
    }

    public String primaryCategory() {
        return primaryCategory;
    }

    private static final Language<CalcTokenId> language = new LanguageHierarchy<CalcTokenId>() {
        @Override
        protected Collection<CalcTokenId> createTokenIds() {
            return EnumSet.allOf(CalcTokenId.class);
        }
        
        @Override
        protected Map<String,Collection<CalcTokenId>> createTokenCategories() {
            Map<String,Collection<CalcTokenId>> cats = new HashMap<String,Collection<CalcTokenId>>();

            // Incomplete literals 
            cats.put("incomplete", EnumSet.of(CalcTokenId.ML_COMMENT_INCOMPLETE));
            // Additional literals being a lexical error
            cats.put("error", EnumSet.of(CalcTokenId.ML_COMMENT_INCOMPLETE));
            
            return cats;
        }

        @Override
        protected Lexer<CalcTokenId> createLexer(LexerRestartInfo<CalcTokenId> info) {
            return new CalcLexer(info);
        }

        @Override
        protected String mimeType() {
            return "text/x-calc";
        }
        
    }.language();

    public static final Language<CalcTokenId> language() {
        return language;
    }

}

Note that it is not needed to publish the underlying LanguageHierarchy extension.
Lexer example:

public final class CalcLexer implements Lexer<CalcTokenId> {

    private static final int EOF = LexerInput.EOF;

    private static final Map<String,CalcTokenId> keywords = new HashMap<String,CalcTokenId>();
    static {
        keywords.put(CalcTokenId.E.fixedText(), CalcTokenId.E);
        keywords.put(CalcTokenId.PI.fixedText(), CalcTokenId.PI);
    }
    
    private LexerInput input;
    
    private TokenFactory<CalcTokenId> tokenFactory;

    CalcLexer(LexerRestartInfo<CalcTokenId> info) {
        this.input = info.input();
        this.tokenFactory = info.tokenFactory();
        assert (info.state() == null); // passed argument always null
    }
    
    public Token<CalcTokenId> nextToken() {
        while (true) {
            int ch = input.read();
            switch (ch) {
                case '+':
                    return token(CalcTokenId.PLUS);

                case '-':
                    return token(CalcTokenId.MINUS);

                case '*':
                    return token(CalcTokenId.STAR);

                case '/':
                    switch (input.read()) {
                        case '/': // in single-line comment
                            while (true)
                                switch (input.read()) {
                                    case '\r': input.consumeNewline();
                                    case '\n':
                                    case EOF:
                                        return token(CalcTokenId.SL_COMMENT);
                                }
                        case '*': // in multi-line comment
                            while (true) {
                                ch = input.read();
                                while (ch == '*') {
                                    ch = input.read();
                                    if (ch == '/')
                                        return token(CalcTokenId.ML_COMMENT);
                                    else if (ch == EOF)
                                        return token(CalcTokenId.ML_COMMENT_INCOMPLETE);
                                }
                                if (ch == EOF)
                                    return token(CalcTokenId.ML_COMMENT_INCOMPLETE);
                            }
                    }
                    input.backup(1);
                    return token(CalcTokenId.SLASH);

                case '(':
                    return token(CalcTokenId.LPAREN);

                case ')':
                    return token(CalcTokenId.RPAREN);

                case '0': case '1': case '2': case '3': case '4':
                case '5': case '6': case '7': case '8': case '9':
                case '.':
                    return finishIntOrFloatLiteral(ch);

                case EOF:
                    return null;

                default:
                    if (Character.isWhitespace((char)ch)) {
                        ch = input.read();
                        while (ch != EOF && Character.isWhitespace((char)ch)) {
                            ch = input.read();
                        }
                        input.backup(1);
                        return token(CalcTokenId.WHITESPACE);
                    }

                    if (Character.isLetter((char)ch)) { // identifier or keyword
                        while (true) {
                            if (ch == EOF || !Character.isLetter((char)ch)) {
                                input.backup(1); // backup the extra char (or EOF)
                                // Check for keywords
                                CalcTokenId id = keywords.get(input.readText());
                                if (id == null) {
                                    id = CalcTokenId.IDENTIFIER;
                                }
                                return token(id);
                            }
                            ch = input.read(); // read next char
                        }
                    }

                    return token(CalcTokenId.ERROR);
            }
        }
    }

    public Object state() {
        return null;
    }

    private Token<CalcTokenId> finishIntOrFloatLiteral(int ch) {
        boolean floatLiteral = false;
        boolean inExponent = false;
        while (true) {
            switch (ch) {
                case '.':
                    if (floatLiteral) {
                        return token(CalcTokenId.FLOAT_LITERAL);
                    } else {
                        floatLiteral = true;
                    }
                    break;
                case '0': case '1': case '2': case '3': case '4':
                case '5': case '6': case '7': case '8': case '9':
                    break;
                case 'e': case 'E': // exponent part
                    if (inExponent) {
                        return token(CalcTokenId.FLOAT_LITERAL);
                    } else {
                        floatLiteral = true;
                        inExponent = true;
                    }
                    break;
                default:
                    input.backup(1);
                    return token(floatLiteral ? CalcTokenId.FLOAT_LITERAL
                            : CalcTokenId.INT_LITERAL);
            }
            ch = input.read();
        }
    }
    
    private Token<CalcTokenId> token(CalcTokenId id) {
        return (id.fixedText() != null)
            ? tokenFactory.getFlyweightToken(id, id.fixedText())
            : tokenFactory.createToken(id);
    }

}

The classes containing token ids and the language description should be part of an API. The lexer should only be part of the implementation.

Providing language embedding.

The embedding may be provided statically in the LanguageHierarchy.embedding() see e.g. org.netbeans.lib.lexer.test.simple.SimpleLanguage.

Or it may be provided dynamically through the xml layer by using file named with the token-id's name with ".instance" suffix located in "Editors/language-mime-type/embed" folder. The file should instantiate the language description for the embedded language.

Exported Interfaces

This table lists all of the module exported APIs with defined stability classifications. It is generated based on answers to questions about the architecture of the module. Read them all...

Group of java interfaces

Interface Name	In/Out	Stability	Specified in What Document?
LexerAPI	Exported	Official

Group of logger interfaces

Interface Name	In/Out	Stability	Specified in What Document?
org.netbeans.lib.lexer.TokenHierarchyOperation	Exported	Friend	`FINE` level lists lexer changes made in tokens both at the root level and embedded levels of the token hierarchy after each document modification. `FINER` level in addition will also check the whole token hierarchy for internal consistency after each modification.

Group of property interfaces

Interface Name	In/Out	Stability	Specified in What Document?
netbeans.debug.lexer.test	Exported	Private	System property to allow deep-compare of token lists - the framework generates and maintains lookahead and state information even for batch-lexed inputs. There are also some additional checks performed that should verify correctness of the framework and the SPI implementation classes being used (for example when flyweight tokens are created the text passed to the token factory is compared to the text in the lexer input).

Implementation Details

Where are the sources for the module?

The sources for the module are in NetBeans CVS in lexer directory.

What do other modules need to do to declare a dependency on this one, in addition to or instead of a plain module dependency?

OpenIDE-Module-Module-Dependencies: org.netbeans.modules.lexer/2 > @SPECIFICATION-VERSION@

Read more about the implementation in the answers to architecture questions.

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

org.netbeans.modules.lexer/2 1.19.0 1

PREV NEXT

FRAMES NO FRAMES

Overview (Lexer) - NetBeans API Javadoc (Current Development Version)

LexerOfficial

Input Sources

TokenSequence and Token

Flyweight Tokens

TokenId and Language

LanguageHierarchy, Lexer, LexerInput and TokenFactory

Language Embedding

Token Hierarchy Snapshots

What is New (see all changes)?

Use Cases

API Usecases

Obtaining of token hierarchy for various inputs.

Obtaining and iterating token sequence over particular swing document from the given offset.

Using language path of the token sequence

Creating token hierarchy snapshot for token hierarchy over a mutable input

Extra information about the input

Filtering out unnecessary tokens

SPI Usecases

Providing language description and lexer.

Providing language embedding.

Exported Interfaces

Group of java interfaces

Group of logger interfaces

Group of property interfaces

Implementation Details

Where are the sources for the module?

What do other modules need to do to declare a dependency on this one, in addition to or instead of a plain module dependency?

Lexer
Official