Lexer - NetBeans Architecture Questions - NetBeans API Javadoc (Current Development Version)

NetBeans Architecture Answers for Lexer module

Author: mmetelka@netbeans.org
Answers as of: May 27, 2007
Answers for questions version: 1.29
Latest available version of questions: 1.29

Interfaces table

Group of java interfaces

Interface Name	In/Out	Stability	Specified in What Document?
LexerAPI	Exported	Official
EditorUtilitiesAPI	Imported	Friend	.../overview-summary.html The module is needed for compilation. The module is used during runtime. Specification version 1.15 is required.
UtilitiesAPI	Imported	Official	../org-openide-util/overview-summary.html The module is needed for compilation. The module is used during runtime. Specification version 6.4 is required.

Group of logger interfaces

Interface Name	In/Out	Stability	Specified in What Document?
org.netbeans.lib.lexer.TokenHierarchyOperation	Exported	Friend	`FINE` level lists lexer changes made in tokens both at the root level and embedded levels of the token hierarchy after each document modification. `FINER` level in addition will also check the whole token hierarchy for internal consistency after each modification.

Group of property interfaces

Interface Name	In/Out	Stability	Specified in What Document?
netbeans.debug.lexer.test	Exported	Private	System property to allow deep-compare of token lists - the framework generates and maintains lookahead and state information even for batch-lexed inputs. There are also some additional checks performed that should verify correctness of the framework and the SPI implementation classes being used (for example when flyweight tokens are created the text passed to the token factory is compared to the text in the lexer input).

General Information

Question (arch-what): What is this project good for?

Answer:

Question (arch-overall): Describe the overall architecture.

Answer:

LexerAPI

API entry point

TokenHierarchy

Input Sources

TokenHierarchy can be created for immutable input sources ( CharSequence or java.io.Reader ) or for mutable input sources (typically javax.swing.text.Document ).
For mutable input source the lexer framework updates the tokens in the token hierarchy automatically with subsequent changes to the underlying text input. The tokens of the hierarchy always reflect the text of the input at the given time.

TokenSequence and Token

TokenHierarchy.tokenSequence() allows to iterate over a list of Token instances.
The token carries a token identification TokenId (returned by Token.id() ) and a text (aka token body) represented as CharSequence (returned by Token.text() ).
TokenUtilities contains many useful methods related to operations with the token's text such as TokenUtilities.equals(CharSequence text1, CharSequence text2), TokenUtilities.startsWith(CharSequence text, CharSequence prefix), etc.
It is also possible to debug the text of the token (replace special chars by escapes) by TokenUtilities.equals(CharSequence text).
A typical token also carries offset of its occurrence in the input text.

Flyweight Tokens

As there are many token occurrences where the token text is the same for all or many occurrences (e.g. java keywords, operators or a single-space whitespace) the memory consumption can be decreased considerably by allowing the creation of flyweight token instances i.e. just one token instance is used for all the token's occurrences in all the inputs.
Flyweight tokens can be determined by Token.isFlyweight().
The flyweight tokens do not carry a valid offset (their internal offset is -1).
Therefore TokenSequence is used for iteration through the tokens (instead of a regular iterator) and it provides TokenSequence.offset() which returns the proper offset even when positioned over a flyweight token.
When holding a reference to the token's instance its offset can also be determined by Token.offset(TokenHierarchy tokenHierarchy). The tokenHierarchy parameter is usually null which gets the offset for the "live" token hierarchy (a snapshot token hierarchy might be used as the parameter too).
For flyweight tokens the Token.offset(TokenHierarchy tokenHierarchy) returns -1 and for regular tokens it gives the same value like TokenSequence.offset().

There may be applications where the flyweight tokens use could be problematic. For example if a parser would like to use token instances in a parse tree nodes to determine the nodes' boundaries then the flyweight tokens would always return offset -1 so the positions of the parse tree nodes could not generally be determined from the tokens only.
Therefore there is a possibility to de-flyweight a token by using TokenSequence.offsetToken() which checks the current token and if it's flyweight then it replaces it with a non-flyweight token instance with a valid offset and with the same properties as the original flyweight token.

TokenId and Language

Token is identified by its id represented by TokenId interface. Token ids for a language are typically implemented as java enums (extensions of Enum ) but it's not mandatory.
All token ids for the given language are described by Language.
Each token id may belong to one or more token categories that allow to better operate tokens of the same type (e.g. keywords or operators).
Each token id may define its primary category TokenId.primaryCategory() and LanguageHierarchy.createTokenCategories() may provide additional categories for the token ids for the given language.
Each language description has a mandatory mime-type specification Language.mimeType()
Although it's a bit non-related information it brings many benefits because with the mime-type the language can be accompanied with an arbitrary sort of settings (e.g. syntax coloring information etc.).

LanguageHierarchy, Lexer, LexerInput and TokenFactory

SPI providers wishing to provide a Language first need to define its SPI counterpart LanguageHierarchy. It mainly needs to define token ids in LanguageHierarchy.createTokenIds() and lexer in LanguageHierarchy.createLexer(LexerInput lexerInput, TokenFactory tokenFactory, Object state, LanguagePath languagePath, InputAttributes inputAttributes).
Lexer reads characters from LexerInput and breaks the text into tokens.
Tokens are produced by using methods of TokenFactory.
As a per-token memory consumption is critical the Token does not have any counterpart in SPI. However the framework prevents instantiation of any other token classes except those contained in the lexer module's implementation.

Language Embedding

With language embedding the flat list of tokens becomes in fact a tree-like hierarchy represented by the TokenHierarchy class. Each token can potentially be broken into a sequence of embedded tokens.
The TokenSequence.embedded() method can be called to obtain the embedded tokens (when positioned on the branch token).
There are two ways of specifying what language is embedded in a token. The language can either be specified explicitly (hardcoded) in the LanguageHierarchy.embedding() method or there can be a LanguageProvider registered in the default Lookup, which will create a Language for the embedded language.
There is no limit on the depth of a language hierarchy and there can be as many embedded languages as needed.
In SPI the language embedding is represented by LanguageEmbedding.

Token Hierarchy Snapshots

Token hierarchy allows to create a snapshot of itself at any given time by using TokenHierarchy.createSnapshot().
Subsequent modifications to the "live" token hierarchy will not affect tokens of the snapshot.
Snapshot creation is cheap because initially the snapshot shares all the tokens with the live hierarchy. With subsequent modifications the initial and ending areas (where the tokens are shared between snapshot and live hierarchy) are maintained by the snapshot. In the middle the snapshot retains the tokens originally present in the live token hierarchy.

Question (arch-usecases): Describe the main use cases of the new API. Who will use it under what circumstances? What kind of code would typically need to be written to use the module?

Answer:

API Usecases

Obtaining of token hierarchy for various inputs.

TokenHierarchy

    String text = "public void m() { }";
    TokenHierarchy hi = TokenHierarchy.create(text, JavaLanguage.description());

    document.readLock();
    try {
        TokenHierarchy hi = TokenHierarchy.get(document);
        ... // explore tokens etc.
    } finally {
        document.readUnlock();
    }

Obtaining and iterating token sequence over particular swing document from the given offset.

    document.readLock();
    try {
        TokenHierarchy hi = TokenHierarchy.get(document);
        TokenSequence ts = hi.tokenSequence();
        // If necessary move ts to the requested offset
        ts.move(offset);
        while (ts.moveNext()) {
            Token t = ts.token();
            if (t.id() == ...) { ... }
            if (TokenUtilities.equals(t.text(), "mytext")) { ... }
            if (ts.offset() == ...) { ... }

            // Possibly retrieve embedded token sequence
            TokenSequence embedded = ts.embedded();
            if (embedded != null) { // Token has a valid language embedding
                ...
            }
        }
    } finally {
        document.readUnlock();
    }

Editor's painting code doing syntax coloring org.netbeans.modules.lexer.editorbridge.LexerLayer in lexer/editorbridge module.
Brace matching code searching for matching brace in forward/backward direction.
Code completion's quick check whether caret is located inside comment token.
Parser constructing a parse tree iterating through the tokens in forward direction.

Using language path of the token sequence

    TokenSequence ts = ...
    LanguagePath lp = ts.languagePath();
    if (lp.size() > 1) { ... } // This is embedded token sequence
    if (lp.topLanguage() == JavaLanguage.description()) { ... } // top-level language of the token hierarchy
    String mimePath = lp.mimePath();
    Object setting-value = some-settings.getSetting(mimePath, setting-name);

Creating token hierarchy snapshot for token hierarchy over a mutable input

    private TokenHierarchy snapshot;
    
    document.readLock();
    try {
        TokenHierarchy hi = TokenHierarchy.get(document);
        snapshot = hi.createSnapshot();
        // Possible later modifications will not affect tokens of the snapshot
    } finally {
        document.readUnlock();
    }

    ...

    document.readLock();
    try {
        TokenSequence ts = snapshot.tokenSequence();
        ...
    } finally {
        document.readUnlock();
    }

Parser constructing a parse tree. The parser may retain "last good snapshot" for the edited file - a snapshot when it was possible to parse the document without errors.

Extra information about the input

InputAttributes

public class MyLexer implements Lexer<MyTokenId> {
    
    private final int version;
    
    ...
    
    public MyLexer(LexerInput input, TokenFactory<MyTokenId> tokenFactory, Object state,
    LanguagePath languagePath, InputAttributes inputAttributes) {
        ...
        
        Integer ver = (inputAttributes != null)
                ? (Integer)inputAttributes.getValue(languagePath, "version")
                : null;
        this.version = (ver != null) ? ver.intValue() : 1; // Use version 1 if not specified explicitly
    }
    
    public Token<MyTokenId> nextToken() {
        ...
        if (recognized-assert-keyword) {
            return (version >= 4) { // "assert" recognized as keyword since version 4
                ? keyword(MyTokenId.ASSERT)
                : identifier();
        }
        ...
    }
    ...
}

    InputAttributes attrs = new InputAttributes();
    // The "true" means global value i.e. for any occurrence of the MyLanguage including embeddings
    attrs.setValue(MyLanguage.description(), "version", Integer.valueOf(3), true);
    TokenHierarchy hi = TokenHierarchy.create(text, false, SimpleLanguage.description(), null, attrs);
    ...

Filtering out unnecessary tokens

    Set<MyTokenId> skipIds = EnumSet.of(MyTokenId.COMMENT, MyTokenId.WHITESPACE);
    TokenHierarchy tokenHierarchy = TokenHierarchy.create(inputText, false,
        MyLanguage.description(), skipIds, null);
    ...

Parser constructing a parse tree. It is not interested in the comment and whitespace tokens so these tokens do not need to be constructed at all.

SPI Usecases

Providing language description and lexer.

org.netbeans.lib.lexer.test.simple.SimpleTokenId

org.netbeans.modules.lexer.editorbridge.calc.lang.CalcTokenId

language()

public enum CalcTokenId implements TokenId {

    WHITESPACE(null, "whitespace"),
    SL_COMMENT(null, "comment"),
    ML_COMMENT(null, "comment"),
    E("e", "keyword"),
    PI("pi", "keyword"),
    IDENTIFIER(null, null),
    INT_LITERAL(null, "number"),
    FLOAT_LITERAL(null, "number"),
    PLUS("+", "operator"),
    MINUS("-", "operator"),
    STAR("*", "operator"),
    SLASH("/", "operator"),
    LPAREN("(", "separator"),
    RPAREN(")", "separator"),
    ERROR(null, "error"),
    ML_COMMENT_INCOMPLETE(null, "comment");


    private final String fixedText;

    private final String primaryCategory;

    private CalcTokenId(String fixedText, String primaryCategory) {
        this.fixedText = fixedText;
        this.primaryCategory = primaryCategory;
    }
    
    public String fixedText() {
        return fixedText;
    }

    public String primaryCategory() {
        return primaryCategory;
    }

    private static final Language<CalcTokenId> language = new LanguageHierarchy<CalcTokenId>() {
        @Override
        protected Collection<CalcTokenId> createTokenIds() {
            return EnumSet.allOf(CalcTokenId.class);
        }
        
        @Override
        protected Map<String,Collection<CalcTokenId>> createTokenCategories() {
            Map<String,Collection<CalcTokenId>> cats = new HashMap<String,Collection<CalcTokenId>>();

            // Incomplete literals 
            cats.put("incomplete", EnumSet.of(CalcTokenId.ML_COMMENT_INCOMPLETE));
            // Additional literals being a lexical error
            cats.put("error", EnumSet.of(CalcTokenId.ML_COMMENT_INCOMPLETE));
            
            return cats;
        }

        @Override
        protected Lexer<CalcTokenId> createLexer(LexerRestartInfo<CalcTokenId> info) {
            return new CalcLexer(info);
        }

        @Override
        protected String mimeType() {
            return "text/x-calc";
        }
        
    }.language();

    public static final Language<CalcTokenId> language() {
        return language;
    }

}

LanguageHierarchy

public final class CalcLexer implements Lexer<CalcTokenId> {

    private static final int EOF = LexerInput.EOF;

    private static final Map<String,CalcTokenId> keywords = new HashMap<String,CalcTokenId>();
    static {
        keywords.put(CalcTokenId.E.fixedText(), CalcTokenId.E);
        keywords.put(CalcTokenId.PI.fixedText(), CalcTokenId.PI);
    }
    
    private LexerInput input;
    
    private TokenFactory<CalcTokenId> tokenFactory;

    CalcLexer(LexerRestartInfo<CalcTokenId> info) {
        this.input = info.input();
        this.tokenFactory = info.tokenFactory();
        assert (info.state() == null); // passed argument always null
    }
    
    public Token<CalcTokenId> nextToken() {
        while (true) {
            int ch = input.read();
            switch (ch) {
                case '+':
                    return token(CalcTokenId.PLUS);

                case '-':
                    return token(CalcTokenId.MINUS);

                case '*':
                    return token(CalcTokenId.STAR);

                case '/':
                    switch (input.read()) {
                        case '/': // in single-line comment
                            while (true)
                                switch (input.read()) {
                                    case '\r': input.consumeNewline();
                                    case '\n':
                                    case EOF:
                                        return token(CalcTokenId.SL_COMMENT);
                                }
                        case '*': // in multi-line comment
                            while (true) {
                                ch = input.read();
                                while (ch == '*') {
                                    ch = input.read();
                                    if (ch == '/')
                                        return token(CalcTokenId.ML_COMMENT);
                                    else if (ch == EOF)
                                        return token(CalcTokenId.ML_COMMENT_INCOMPLETE);
                                }
                                if (ch == EOF)
                                    return token(CalcTokenId.ML_COMMENT_INCOMPLETE);
                            }
                    }
                    input.backup(1);
                    return token(CalcTokenId.SLASH);

                case '(':
                    return token(CalcTokenId.LPAREN);

                case ')':
                    return token(CalcTokenId.RPAREN);

                case '0': case '1': case '2': case '3': case '4':
                case '5': case '6': case '7': case '8': case '9':
                case '.':
                    return finishIntOrFloatLiteral(ch);

                case EOF:
                    return null;

                default:
                    if (Character.isWhitespace((char)ch)) {
                        ch = input.read();
                        while (ch != EOF && Character.isWhitespace((char)ch)) {
                            ch = input.read();
                        }
                        input.backup(1);
                        return token(CalcTokenId.WHITESPACE);
                    }

                    if (Character.isLetter((char)ch)) { // identifier or keyword
                        while (true) {
                            if (ch == EOF || !Character.isLetter((char)ch)) {
                                input.backup(1); // backup the extra char (or EOF)
                                // Check for keywords
                                CalcTokenId id = keywords.get(input.readText());
                                if (id == null) {
                                    id = CalcTokenId.IDENTIFIER;
                                }
                                return token(id);
                            }
                            ch = input.read(); // read next char
                        }
                    }

                    return token(CalcTokenId.ERROR);
            }
        }
    }

    public Object state() {
        return null;
    }

    private Token<CalcTokenId> finishIntOrFloatLiteral(int ch) {
        boolean floatLiteral = false;
        boolean inExponent = false;
        while (true) {
            switch (ch) {
                case '.':
                    if (floatLiteral) {
                        return token(CalcTokenId.FLOAT_LITERAL);
                    } else {
                        floatLiteral = true;
                    }
                    break;
                case '0': case '1': case '2': case '3': case '4':
                case '5': case '6': case '7': case '8': case '9':
                    break;
                case 'e': case 'E': // exponent part
                    if (inExponent) {
                        return token(CalcTokenId.FLOAT_LITERAL);
                    } else {
                        floatLiteral = true;
                        inExponent = true;
                    }
                    break;
                default:
                    input.backup(1);
                    return token(floatLiteral ? CalcTokenId.FLOAT_LITERAL
                            : CalcTokenId.INT_LITERAL);
            }
            ch = input.read();
        }
    }
    
    private Token<CalcTokenId> token(CalcTokenId id) {
        return (id.fixedText() != null)
            ? tokenFactory.getFlyweightToken(id, id.fixedText())
            : tokenFactory.createToken(id);
    }

}

The classes containing token ids and the language description should be part of an API. The lexer should only be part of the implementation.

Providing language embedding.

LanguageHierarchy.embedding()

org.netbeans.lib.lexer.test.simple.SimpleLanguage

Or it may be provided dynamically through the xml layer by using file named with the token-id's name with ".instance" suffix located in "Editors/language-mime-type/embed" folder. The file should instantiate the language description for the embedded language.

Question (arch-time): What are the time estimates of the work?

Answer:

Dynamic language embedding binding through xml layer.
CharPreprocessor servicing and tests.
TokenHierarchy snapshots correctness (will provide random unit tests; one unit test is currently failing).
Token hierarchy for Reader.
TokenFactory.createBranchToken() impl.
Providing JavaCC and Antlr support.
Support for token positions (may add API).

Question (arch-quality): How will the quality of your code be tested and how are future regressions going to be prevented?

Answer:

org.netbeans.lib.lexer.test.TestRandomModify

Question (arch-where): Where one can find sources for your module?

Answer:

The sources for the module are in NetBeans CVS in lexer directory.

Project and platform dependencies

Question (dep-nb): What other NetBeans projects and modules does this one depend on?

Answer:

These modules are required in project.xml:

EditorUtilitiesAPI - The module is needed for compilation. The module is used during runtime. Specification version 1.15 is required.
UtilitiesAPI - The module is needed for compilation. The module is used during runtime. Specification version 6.4 is required.

Question (dep-non-nb): What other projects outside NetBeans does this one depend on?

Answer:

Question (dep-platform): On which platforms does your module run? Does it run in the same way on each?

Answer:

Question (dep-jre): Which version of JRE do you need (1.2, 1.3, 1.4, etc.)?

Answer:

Question (dep-jrejdk): Do you require the JDK or is the JRE enough?

Answer:

Deployment

Question (deploy-jar): Do you deploy just module JAR file(s) or other files as well?

Answer:

Question (deploy-nbm): Can you deploy an NBM via the Update Center?

Answer:

Question (deploy-shared): Do you need to be installed in the shared location only, or in the user directory only, or can your module be installed anywhere?

Answer:

Question (deploy-packages): Are packages of your module made inaccessible by not declaring them public?

Answer:

Question (deploy-dependencies): What do other modules need to do to declare a dependency on this one, in addition to or instead of the normal module dependency declaration (e.g. tokens to require)?

Answer:

OpenIDE-Module-Module-Dependencies: org.netbeans.modules.lexer/2 > @SPECIFICATION-VERSION@

Compatibility with environment

Question (compat-i18n): Is your module correctly internationalized?

Answer:

Question (compat-standards): Does the module implement or define any standards? Is the implementation exact or does it deviate somehow?

Answer:

Question (compat-version): Can your module coexist with earlier and future versions of itself? Can you correctly read all old settings? Will future versions be able to read your current settings? Can you read or politely ignore settings stored by a future version?

Answer:

Question (compat-deprecation): How the introduction of your project influences functionality provided by previous version of the product?

Answer:

The current API completely replaces the original one therefore the major version of the module was increased from 1 to 2.
There are no plans to deprecated any part of the present API and it should be evolved in a compatible way.

Access to resources

Question (resources-file): Does your module use java.io.File directly?

Answer:

Question (resources-layer): Does your module provide own layer? Does it create any files or folders in it? What it is trying to communicate by that and with which components?

Answer:

Question (resources-read): Does your module read any resources from layers? For what purpose?

Answer:

Question (resources-mask): Does your module mask/hide/override any resources provided by other modules in their layers?

Answer:

Question (resources-preferences): Does your module uses preferences via Preferences API? Does your module use NbPreferences or or regular JDK Preferences ? Does it read, write or both ? Does it share preferences with other modules ? If so, then why ?

Answer:

No.

Lookup of components

Question (lookup-lookup): Does your module use org.openide.util.Lookup or any similar technology to find any components to communicate with? Which ones?

Answer:

Question (lookup-register): Do you register anything into lookup for other code to find?

Answer:

Question (lookup-remove): Do you remove entries of other modules from lookup?

Answer:

Execution Environment

Question (exec-property): Is execution of your code influenced by any environment or Java system (System.getProperty) property? On a similar note, is there something interesting that you pass to java.util.logging.Logger? Or do you observe what others log?

Answer:

org.netbeans.lib.lexer.TokenHierarchyOperation

FINE

FINER

netbeans.debug.lexer.test

Question (exec-component): Is execution of your code influenced by any (string) property of any of your components?

Answer:

Question (exec-ant-tasks): Do you define or register any ant tasks that other can use?

Answer:

Question (exec-classloader): Does your code create its own class loader(s)?

Answer:

Question (exec-reflection): Does your code use Java Reflection to execute other code?

Answer:

Question (exec-privateaccess): Are you aware of any other parts of the system calling some of your methods by reflection?

Answer:

Question (exec-process): Do you execute an external process from your module? How do you ensure that the result is the same on different platforms? Do you parse output? Do you depend on result code?

Answer:

Question (exec-introspection): Does your module use any kind of runtime type information (instanceof, work with java.lang.Class, etc.)?

Answer:

Question (exec-threading): What threading models, if any, does your module adhere to? How the project behaves with respect to threading?

Answer:

Question (security-policy): Does your functionality require modifications to the standard policy file?

Answer:

Question (security-grant): Does your code grant additional rights to some other code?

Answer:

Format of files and protocols

Question (format-types): Which protocols and file formats (if any) does your module read or write on disk, or transmit or receive over the network? Do you generate an ant build script? Can it be edited and modified?

Answer:

Question (format-dnd): Which protocols (if any) does your code understand during Drag & Drop?

Answer:

Question (format-clipboard): Which data flavors (if any) does your code read from or insert to the clipboard (by access to clipboard on means calling methods on java.awt.datatransfer.Transferable?

Answer:

Performance and Scalability

Question (perf-startup): Does your module run any code on startup?

Answer:

Question (perf-exit): Does your module run any code on exit?

Answer:

Question (perf-scale): Which external criteria influence the performance of your program (size of file in editor, number of files in menu, in source directory, etc.) and how well your code scales?

Answer:

Question (perf-limit): Are there any hard-coded or practical limits in the number or size of elements your code can handle?

Answer:

Question (perf-mem): How much memory does your component consume? Estimate with a relation to the number of windows, etc.

Answer:

DefaultToken: 24 bytes
StringToken: 32 bytes (but only used for flyweight tokens)
PrepToken: 32 bytes plus text storage size (but only used for tokens where character preprocessing was necessary)

Question (perf-wakeup): Does any piece of your code wake up periodically and do something even when the system is otherwise idle (no user interaction)?

Answer:

Question (perf-progress): Does your module execute any long-running tasks?

Answer:

Question (perf-huge_dialogs): Does your module contain any dialogs or wizards with a large number of GUI controls such as combo boxes, lists, trees, or text areas?

Answer:

Question (perf-menus): Does your module use dynamically updated context menus, or context-sensitive actions with complicated and slow enablement logic?

Answer:

Question (perf-spi): How the performance of the plugged in code will be enforced?

Answer: