-
Notifications
You must be signed in to change notification settings - Fork 301
The Parsing Process
Rubberduck processes the code in all unprotected modules in a five-step process. First, in the parser state Pending
, the projects and modules to parse are determined. Then, in the parser state LoadingReferences
, the references currently used by the projects, e.g. the Excel object model, and some built-in declarations are loaded into Rubberduck. Following this, the actual processing of the code begins. Between the parser states Parsing
and Parsed
the code gets parsed into parse trees with the help of Antlr4. Following this, between the states ResolvingDeclarations
and ResolvedDeclarations
the module, method and variable declarations are generated based on the parse tree. Finally, between the states ResolvingReferences
and Ready
the parse trees are walked a second time to determine the references to the declarations within the code.
At each state change, an event is fired which can be handled by any feature subscribing to it, e.g. the CodeExplorer
, which listens for the state change to ResolvedDeclarations
.
The entry point for the parsing process is the ParseCoordinator
inside the Rubberduck.Parsing
assembly. It coordinates the parsing process and is responsible for triggering the appropriate state changes at the right time, for which it uses a IParserStateManager
passed to it. To trigger the different stages of the parsing process, the ParseCoordinator
uses a IParsingStageService
. This is a facade passed to it providing a unified interface for calling the individual stages, which are all implemented in an individual set of classes. Each has a concurrent version for production and a synchronous one for testing. The latter was needed because of concurrency problems of the mocking framework.
Every parsing run gets executed in fresh background task. Moreover, to always be in a consistent state, we allow only one parsing run to execute at a time. This is achieved by acquiring a lock in a top level method. This top level method is also the point at which any cancellation or unexpected exception will be caught and logged.
The first step of the actual parsing process is to set the overall parser state to Pending
. This signals to all components of Rubberduck that we left a fully usable state. Afterwards, we refresh the projects cache on the RubberduckParserState
asking the VBE for the loaded projects and then acquire a collection of the modules currenlty present.
After setting the overall parser state to LoadingReferences
, the declarations for the project references, i.e. the references selected in Tools
--> References...
, get loaded into Rubberduck. This is done using the ReferencedDeclarationsCollector
in the Rubberduck.Parsing.ComReflection
namespace, which reads the appropriate type libraries and generates the corresponding declarations.
Note that the order in the References
dialog determins what procedure or field an identifier resolves to in VBA if two or more references define a procedure or field of the same name. This priorisation is taken into account when loading the references.
Unfortunately, we are currently not able to load all built-in declarations from the type libraries: there are some hidden members of the MSforms library, some special syntax declarations like LBound
and everything related to Debug
, and aliases for built-in functions like Left
, where Left
is the alias for the actual hidden function defined in the VBA type library. These get loaded as a set of hand-crafted declarations defined in the Rubberduck.Parsing.Symbols.DeclarationLoaders
namespace.
At the start of the processing of the actual code, the parser state is set to Parsing
. However, this time this is achieved by setting the individual modules states of the modules to be parsed and then evaluating the overall state.
Each module gets parsed separately using an individual ComponentParseTask
from the Rubberduck.Parsing.VBA
namespace, which is powered by the Antlr4 parser generator. The end result is a pair of two parse trees providing a structured representation of the code one time as seen in the VBE and one time as exported to file.
The general process using Antlr is to provide the code to a lexer that turns the code into a stream of tokens based on lexer rules. (The lexer rules used in Rubberduck can be found in the file VBALexer.g4
in the Rubberduck.Parsing.Grammar
namespace.) Then this tokenstream gets processed by a parser that generates a parse tree based on the stream and a set of parser rules describing the syntactic rules of the language. (The VBA parser rules used in Rubberduck can be found in the file VBAParser.g4
in the Rubberduck.Parsing.Grammar
namespace. However, there are more specialised rules in the project.) The parse tree then consists of nodes of various types corresponding to the rules in the parser rules.
Even when counting the Antlr workflow described above as one step, the actual parsing process in the ComponentParseTask
is a multi stage process in itself. This has two reasons: there are precompiler directives in VBA and some information regarding modules is hidden from the user inside the VBE, namely attributes.
The precompiler directives in VBA allow to conditionally select which code is alive. This allows to write code that would only be legal VBA after evaluating the conditional compilation directives. Accordingly, this has to be done before the code reaches the parser. To achieve this, we parse each module first with a specialized grammar for the precompiler directives and then hide all tokens that are dead after the evaluation from the VBA parser, including the precompiler directived themselves, by sending the tokens to a hidden channel in the tokenstream. Afterwards, the dead code is still part of the text representation of the tokenstream by disregarded by the parser.
To cover both the attributes, which are only present in the exported modules, and provide meaningful linenumbers in inspection results, errors and the command bar, we parse both the attributes and the code as seen in the VBE code pane into a separate parse tree and save both on the ModuleState
belonging to the module on the RubberduckParserState
.
One thing of note is that Antlr provides two different kinds of parsers: the LL parser that basically parses all valid input for every not indireclty left-recursive grammar (Our VBA grammar satisfies this.) and the SLL parser, which is considerably faster but cannot necessarily parse all valid input for all such grammars. Both parsers are guaranteed to yield the same result whenever the parse succeeds at all. Since the SLL parser works for next to all commonly encountered code, we first parse using it and fall back to the LL parser if there is a parser error.
Following the parse, the state of the module is set to Parsed
on a successful parse and to ParserError
, otherwise. After all modules have finished parsing, the overall parser state is evaluated. If there has been any parser error, the parsing process ends here.
After parsing the code into parse trees, it is time to generate the declarations for the procedures, functions, properties, variables and arguments in the code.
First, the state of all modules gets set to ResolvingDeclarations
, analogous to the start of parsing the code. Then the tree walker and listener infrastructure of Antlr is used to traverse the parse trees and generate declarations whenever the appropriate grammar constructs are encountered. This is done inside the implementations of IDeclarationResolveRunner
in the Rubberduck.Parsing.VBA
namespace.
Note that there is still some information missing on the declarations at this point that cannot be determined in this first pass over the parse trees. E.g. the supertypes of classes implementing the interface of another class are not known yet and, although the name of the type of each declaration is already known, the actual type might not be known yet. For both cases we first have to know all declarations.
After the parse trees of all modules have been walked, the overall parser state gets set to ResolvedDeclarations
, unless there has been an error, which would result in the state ResolverError
and an immediate stop of the parsing run.
After all declarations are known, it is possible to resolve all references to these declarations within the code, beit as types, supertypes or in expressions. This is done using the implementations of IReferenceResolveRunner
in the Rubberduck.Parsing.VBA
namespace.
First, the state of the modules for which to resolve the references gets set to ResolvingReferences
and the overall state gets evaluated. Then the CompilationPasses
run. In these the type names found when resolving the declarations get resolved to the actual types. Moreover, the type hierarchy gets determined, i.e. super- and and subtypes get added to the declarations based on the implements statements in the code.
After that, the parse trees get walked again to find all references to the declarations. This is a slightly complicated process because of the various language constructs in VBA. As a side effect, the variables not resolving to any declaration get collected. Based on these, new declarations get created, which get marked as undeclared. These form the basis for the inspection for undeclared variables.
After all references in a module got resolved, the module state gets set to Ready
. If there is some error, the module state gets set to ResolverError
. Finally, the overall state gets evaluated and the parsing run ends.
rubberduckvba.com
© 2014-2021 Rubberduck project contributors
- Contributing
- Build process
- Version bump
- Architecture Overview
- IoC Container
- Parser State
- The Parsing Process
- How to view parse tree
- UI Design Guidelines
- Strategies for managing COM object lifetime and release
- COM Registration
- Internal Codebase Analysis
- Projects & Workflow
- Adding other Host Applications
- Inspections XML-Doc
-
VBE Events