master/chapter/ch4.tex

674 lines
50 KiB
TeX

\chapter{Implementation}
In this chapter, the implementation of the tool utilizing the \DSL and \DSLSH will be presented. It will describe the overall architecture of the tool, the flow of data throughout, and how the different stages of transforming user code are completed.
\section{Architecture of the solution}
The architecture of the solution described in this thesis is illustrated in \figFull[fig:architecture]
In this tool, there exists two multiple ways to define a proposal, and each provide the same functionality, they only differ in syntax and writing-method. One can either write the definition in \DSL, or one can use the program API with \DSLSH, which is more friendly for programs to interact with.
In the architecture diagram of Figure~\ref{fig:architecture}, ellipse nodes show data passed into the program sections, and rectangular nodes is a specific section of the program. The architecture is split into seven levels, where each level is a step of the program. The initial step is the proposal definition, the definition can have two different forms, either it is \DSL code, or it can be a JavaScript object using the self hosted in \DSLSH. If we use \DSL, the second step is parsing it using Langium~\cite{Langium}, this parses the raw source code into an AST. If \DSLSH is used, we have to build the prelude, so we have to extract the wildcard definitions from JavaScript source code. At this point the two paths meet at the second step, which is wildcard extraction. At this step, if \DSL was used, the wildcards are extracted from the template. If \DSLSH was used extraction is not needed. In both cases we parse the wildcard type expressions into an AST. The third step is parsing the raw source code with Babel~\cite{Babel}. It is also at this point we parse the users source code into an AST. The fourth step is translating the Babel AST into our own custom tree structure for simpler traversal. Once all data is prepared, the fifth step is matching the user's AST against the \texttt{applicable to} template AST. Once all matches have been found, we transplant the wildcard matches into the \texttt{transform to} template, and insert it back into the users code in step six. We have at this point transformed the users code, the final step seven is generating it back into source code.
\iffalse
\begin{description}
\item[\DSL Code] is the raw text definition of proposals
\item[Self-Hosted Object] is the self-hosted version in \DSLSH format
\item[1. Langium Parser] takes raw \DSL source code, and parses it into a DSL
\item[2. Wildcard parsing] extracts the wildcards from the raw template definition in \DSL, and parse
\item[1. Prelude-builder] translates JavaScript prelude into array of wildcard strings
\item[3. Babel] parses the templates and the users source code into an AST
\item[4. Custom Tree Builder] translates the Babel AST structure into our tree structure
\item[5. Matcher] finds matches with \texttt{applicable to} template in user code
\item[6. Transformer] performs transformation defined in \texttt{transform to} template to each match of the users AST
\item[7. Generator] generates source code from the transformed user AST
\end{description}
\fi
\begin{figure}[H]
\begin{center}
\begin{tikzpicture}[
roundnode/.style={ellipse, draw=red!60, fill=red!5, very thick, minimum size=7mm},
squarednode/.style={rectangle, draw=red!60, fill=red!5, very thick, minimum size=5mm}
]
\node[squarednode] (preParser) {2. Wildcard Extraction};
\node[squarednode] (preludebuilder) [above right=of preParser] {1. Prelude Builder};
\node[roundnode] (selfhostedjsoninput) [above=of preludebuilder] {Self-Hosted Object};
\node[squarednode] (langium) [above left=of preParser] {1. Langium Parser};
\node[roundnode] (jstqlcode) [above=of langium] {JSTQL Code};
\node[squarednode] (babel) [below=of preParser] {3. Babel};
\node[roundnode] (usercode) [left=of babel] {User source code};
\node[squarednode] (treebuilder) [below=of babel] {4. Custom Tree builder};
\node[squarednode] (matcher) [below=of treebuilder] {5. Matcher};
\node[squarednode] (transformer) [below=of matcher] {6. Transformer};
\node[squarednode] (joiner) [below=of transformer] {7. Generator};
\draw[->] (jstqlcode.south) -- (langium.north);
\draw[->] (langium.south) |- (preParser.west);
\draw[->] (preParser.south) |- (babel.north);
\draw[->] (babel.south) -- (treebuilder.north);
\draw[->] (treebuilder.south) -- (matcher.north);
\draw[->] (matcher.south) -- (transformer.north);
\draw[->] (transformer.south) -- (joiner.north);
\draw[->] (selfhostedjsoninput.south) -- (preludebuilder.north);
\draw[->] (preludebuilder.south) |- (preParser.east);
\draw[->] (usercode.east) -- (babel.west);
\end{tikzpicture}
\end{center}
\caption[Tool architecture]{Overview of tool architecture}
\label{fig:architecture}
\end{figure}
\section{Parsing \DSL using Langium}
In this section, the implementation of the parser for \DSL will be described. This section will outline the tool Langium, used as a parser-generator to create the AST used by the tool later to perform the transformations.
\subsection{Langium}
Langium~\cite{Langium} is a language workbench primarily used to create parsers and Integrated Development Environments for domain specific languages. These kinds of parsers produce Abstract Syntax Trees that is later used to create interpreters or other tooling. In this project, we use Langium to generate an AST definition in the form of TypeScript Objects. These objects and their structure are used as definitions for the tool to do matching and transformation of user code.
In order to generate this parser, Langium requires a definition of a grammar. A grammar is a specification that describes a valid program. The \DSL grammar describes the structure of \DSL, such as \texttt{proposals}, \texttt{cases}, \texttt{applicable to}, and \texttt{transform to}. A grammar in Langium starts by describing the \texttt{Model}. The model is the top entry of the grammar, this is where the description of all valid top level statements.
Contained within the \texttt{Model} rule, is one or more proposals. Each proposal is defined with the rule \texttt{Proposals}, and starts with the keyword \texttt{proposal}, followed by a name, and a code block. This rule is designed to contain every definition of a transformation related to a specific proposal. To hold every transformation definition, a proposal definition contains one or more cases.
The \texttt{Case} rule is created to contain a single transformation. Each case starts with the keyword \texttt{case}, followed by a name for the current case, then a block for that case's fields. Cases are designed in this way to separate different transformation definitions within a proposal. Each case contains a single definition used to match against user code, and a definition used to transform a match.
The rule, \texttt{AplicableTo}, is designed to hold a single template used for matching. It starts with the keywords \texttt{applicable} and \texttt{to}, followed by a block designed to hold the matching template definition. The template is defined as the terminal \texttt{STRING}, and is parsed as a raw string for characters by Langium~\cite{Langium}.
The rule, \texttt{TransformTo}, is created to contain a single template used for transforming a match. It starts with the keywords \texttt{transform} and \texttt{to}, followed by a block that holds the transformation definition. This transformation definition is declared with the terminal \texttt{STRING}, and is parser at a string of characters, same as the template in \texttt{applicable to}.
In order to define exactly what characters/tokens are legal in a specific definition, Langium uses terminals defined using Regular Expressions, these allow for a very specific character-set to be legal in specific keys of the AST generated by the parser generated by Langium. In the definition of \texttt{Proposal} and \texttt{Pair} the terminal \texttt{ID} is used, this terminal is limited to allow for only words and can only begin with a character of the alphabet or an underscore. In \texttt{Section} the terminal \texttt{STRING} is used, this terminal is meant to allow any valid JavaScript code and the custom DSL language described in \ref{sec:DSL_DEF}. Both these terminals defined allows Langium to determine exactly what characters are legal in each location.
\begin{lstlisting}[caption={Definition of \DSL in Langium.}, label={def:JSTQLLangium}]
grammar Jstql
entry Model:
(proposals+=Proposal)*;
Proposal:
'proposal' name=ID "{"
(case+=Case)+
"}";
Case:
"case" name=ID "{"
aplTo=ApplicableTo
traTo=TransformTo
"}";
ApplicableTo:
"applicable" "to" "{"
apl_to_code=STRING
"}";
TransformTo:
"transform" "to" "{"
transform_to_code=STRING
"}";
hidden terminal WS: /\s+/;
terminal ID: /[_a-zA-Z][\w_]*/;
terminal STRING: /"[^"]*"|'[^']*'/;
\end{lstlisting}
In the case of \DSL, we are not implementing a programming language meant to be executed. We are using Langium in order to generate an AST that will be used as a markup language, similar to YAML, JSON or TOML~\cite{TOML}. The main reason for using Langium in such an unconventional way is Langium provides support for Visual Studio Code integration, and it solves the issue of parsing the definition of each proposal manually. However with only the grammar we cannot actually verify the wildcards placed in \texttt{apl\_to\_code} and \texttt{transform\_to\_code} are correctly written. This is done by using a feature of Langium called \texttt{Validator}.
\subsection*{Langium Validator}
A Langium validator allows for further checks DSL code, a validator allows for the implementation of specific checks on specific parts of the grammar.
\DSL does not allow empty typed wildcard definitions in \texttt{applicable to} blocks, this means a wildcard cannot be untyped or allow any AST type to match against it. This is not possible to verify within the grammar, as inside the grammar the code is simply defined as a \texttt{STRING} terminal. This means further checks have to be implemented using code. In order to do this we have a specific \texttt{Validator} implemented on the \texttt{Case} definition of the grammar. This means every time anything contained within a \texttt{Case} is updated, the language server created with Langium will perform the validation step and report any errors.
The validator uses \texttt{Case} as its entry point, as it allows for a checking of wildcards in both \texttt{applicable to} and \texttt{transform to}, allowing for a check for whether a wildcard identifier used in \texttt{transform to} exists in the definition of \texttt{applicable to}.
\begin{lstlisting}[language={JavaScript}]
export class JstqlValidator {
validateWildcardAplTo(pair: Pair, accept: ValidationAcceptor): void {
try {
if (validationResultAplTo.errors.length != 0) {
accept("error", validationResultAplTo.errors.join("\n"), {
node: pair.aplTo,
property: "apl_to_code",
});
}
if (validationResultTraTo.length != 0) {
accept("error", validationResultTraTo.join("\n"), {
node: pair.traTo,
property: "transform_to_code",
});
}
} catch (e) {}
}
}
\end{lstlisting}
\subsection*{Using Langium as a parser}
Langium~\cite{Langium} is designed to automatically generate extensive tool support for the language specified using its grammar. However, in our case we have to parse the \DSL definition using Langium, and then extract the Abstract syntax tree generated in order to use the information it contains.
To use the parser generated by Langium, we created a custom function \texttt{parseDSLtoAST}, which takes a string as an input (the raw \DSL code), and outputs the pure AST using the format described in the grammar, see Listing \ref{sec:DSL_DEF}. This function is exposed as a custom API for our tool to interface with. This also means our tool is dependent on the implementation of the Langium parser to function with \DSL. The implementation of \DSLSH is entirely independent.
When interfacing with the Langium parser to get the Langium generated AST, the exposed API function is imported into the tool, when this API is executed, the output is on the form of the Langium \texttt{Model}, which follows the same form as the grammar. This is then transformed into an internal object structure used by the tool, this structure is called \texttt{TransformRecipe}, and is then passed in to perform the actual transformation.
\section{Wildcard extraction and parsing}
In order to refer to internal DSL variables defined in \texttt{applicable to} and \texttt{transform to} blocks of the transformation, we need to extract this information from the template definitions and pass that on to the matcher.
\subsection*{Why not use Langium for wildcard parsing?}
Langium has support for creating a generator to output an artifact, which is some transformation applied to the AST built by the Langium parser. This suits the needs of \DSL quite well and could be used to extract the wildcards from each \texttt{pair} and create the \texttt{TransformRecipe}. This is the official way the developers of Langium want this kind of functionality to be implemented, however, the implementation would still be mostly the same, as the parsing of the wildcards still has to be done "manually" with code. Therefore, it was decided for this project to keep the parsing of the wildcards within the tool itself. If we were to use Langium generators to parse the wildcards, it would make \DSLSH not entirely independent, and the entire tool would rely on Langium. This is not preferred as that would mean both ways of defining a proposal are reliant of Langium. The reason for using our own extractor is to allow for an independent way to define transformations using our tool.
\subsection*{Extracting wildcards from \DSL}
In order to allow the use of Babel~\cite{Babel}, the wildcards present in the \texttt{applicable to} blocks and \texttt{transform to} blocks have to be parsed and replaced with some valid JavaScript. This is done by using a pre-parser that extracts the information from the wildcards and inserts an \texttt{Identifier} in their place.
To extract the wildcards from the template, we look at each character in the template. If a start token of a wildcard is discovered, which is denoted by \texttt{<<}, everything after that until the closing token, which is denoted by \texttt{>>}, is then treated as an internal DSL variable and will be stored by the tool. A variable \texttt{flag} is used (line 5,10 \ref{lst:extractWildcard}), when the value of flag is false, we know we are currently not inside a wildcard block, this allows us to just pass the character through to the variable \texttt{cleanedJS} (line 196 \ref{lst:extractWildcard}). When \texttt{flag} is true, we know we are currently inside a wildcard block and we collect every character of the wildcard block into \texttt{temp}. Once we hit the end of the wildcard block, when we have consumed the entirety of the wildcard, the contents of the \texttt{temp} variable is passed to a tokenizer, then the tokens are parsed by a recursive descent parser (line 10-21 \ref{lst:extractWildcard}).
Once the wildcard is parsed, and we know it is safely a valid wildcard, we insert an identifier into the JavaScript template where the wildcard would reside. This allows for easier identifications of wildcards when performing matching/transformation as we can identify whether or not an Identifier in the code is the same as the identifier for a wildcard. This however, does introduce the problem of collisions between the wildcard identifiers inserted and identifiers present in the users code. In order to avoid this, the tool adds \texttt{\_\-\-\_} at the beginning of every identifier inserted in place of a wildcard. This allows for easier identification of if an Identifier is a wildcard, and avoids collisions where a variable in the user code has the same name as a wildcard inserted into the template. This can be seen on line 187 of the example below.
\begin{lstlisting}[language={JavaScript}, caption={Extracting wildcard from template.}, label={lst:extractWildcard}]
export function parseInternal(code: string): InternalParseResult {
for (let i = 0; i < code.length; i++) {
if (code[i] === "<" && code[i + 1] === "<") {
// From now in we are inside of the DSL custom block
flag = true;
i += 1;
continue;
}
if (flag && code[i] === ">" && code[i + 1] === ">") {
// We encountered a closing tag
flag = false;
try{
let wildcard = new WildcardParser(
new WildcardTokenizer(temp).tokenize()
).parse();
cleanedJS += collisionAvoider(wildcard.identifier.name);
prelude.push(wildcard);
i += 1;
temp = "";
continue;
}
catch (e){
// We probably encountered a bitshift operator, append temp to cleanedJS
}
}
if (flag) {
temp += code[i];
} else {
cleanedJS += code[i];
}
}
return { prelude, cleanedJS };
}
\end{lstlisting}
\paragraph*{Parsing wildcard}
Once a wildcard has been extracted from definitions inside \DSL, they have to be parsed into a simple Tree to be used when matching against the wildcard. This is accomplished by using a simple tokenizer and a~\cite{RecursiveDescent}{recursive descent parser}.
Our tokenizer takes the raw stream of input characters extracted from the wildcard block within the template, and determines which part is what token. Due to the very simple nature of the type expressions, no ambiguity is present with the tokens, so determining what token is meant to come at what time is quite trivial. We use a switch case on the current token, if the token is of length one we accept it and move on to the next character. If the next character is an unexpected one it will produce an error. The tokenizer also groups tokens with a \textit{token type}, this allows for an simpler parsing of the tokens later.
A recursive descent parser is created to closely mimic the grammar of the language the parser is implemented for, where we define functions for handling each of the non-terminals and ways to determine what non terminal each of the token-types result in. The type expression language is a very simple Boolean expression language, making parsing quite simple.
\begin{lstlisting}[caption={Grammar of type expressions}, label={ex:grammarTypeExpr}]
Wildcard:
Identifier ":" MultipleMatch
MultipleMatch:
GroupExpr "*"
| TypeExpr
TypeExpr:
BinaryExpr
| UnaryExpr
| PrimitiveExpr
BinaryExpr:
TypeExpr { Operator TypeExpr }*
UnaryExpr:
{UnaryOperator}? TypeExpr
PrimitiveExpr:
GroupExpr | Identifier
GroupExpr:
"(" TypeExpr ")"
\end{lstlisting}
The grammar of the type expressions used by the wildcards can be seen in \figFull[ex:grammarTypeExpr], the grammar is written in something similar to Extended Backus-Naur form, where we define the terminals and non-terminals in a way that makes the entire grammar \textit{solvable} by the Recursive Descent parser.
Our recursive descent parser produces an~\cite{AST1,AST2}{AST} which is later used to determine when a wildcard can be matched against a specific AST node, the full definition of this AST can be seen in Appendix \ref{ex:typeExpressionTypes}. We use this AST by traversing it using a~\cite{VisitorPattern}{visitor pattern} and comparing each \texttt{Identifier} against the specific AST node we are currently checking, and evaluating all subsequent expressions and producing a boolean value, if this value is true, the node is matched against the wildcard, if not then we do not have a match.
\paragraph*{Extracting wildcards from \DSLSH}
The self-hosted version \DSLSH also requires some form of pre-parsing in order to prepare the internal DSL environment. This is relatively minor and only parsing directly with no insertion compared to \DSL.
In order to use JavaScript as the meta language, we define a \texttt{prelude} on the object used to define the transformation case. This prelude is required to consist of several \texttt{Variable declaration} statements, where the variable names are used as the internal DSL variables and right side expressions are strings that contain the type expression used to determine a match for that specific wildcard.
We use Babel to generate the AST of the \texttt{prelude} definition, this allows us to get a JavaScript object structure. Since the structure is very strictly defined, we can expect every \texttt{stmt} of \texttt{stmts} to be a variable declaration, otherwise throw an error for invalid prelude. Then the string value of each of the variable declarations is passed to the same parser used for \DSL wildcards.
The reason this is preferred is it allows us to avoid having to extract the wildcards and inserting an Identifier.
\section{Using Babel to parse}
\label{sec:BabelParse}
Allowing the tool to perform transformations of code requires the generation of an Abstract Syntax Tree from the users code, \texttt{applicable to} and \texttt{transform to}. This means parsing JavaScript into an AST, in order to do this we use a tool~\cite[Babel]{Babel}.
The most important reason for choosing to use Babel for the purpose of generating the AST's used for transformation is due to the JavaScript community surrounding Babel. As this tool is dealing with proposals before they are part of JavaScript, a parser that supports early proposals for JavaScript is required. Babel works closely with TC39 to support experimental syntax~\cite{BabelProposalSupport} through its plugin system, which allows the parsing of code not yet part of the language.
\subsection*{Custom Tree Structure}
Performing matching and transformation on each of the sections inside a \texttt{case} definition, they have to be parsed into and AST in order to allow the tool to match and transform accordingly. To do this the tool uses the library~\cite[Babel]{Babel} to generate an AST data structure. However, this structure does not suit traversing multiple trees at the same time, this is a requirement for matching and transforming. Therefore we use this Babel AST and transform it into a simple custom tree structure to allow for simple traversal of the tree.
As can be seen in \figFull[def:TreeStructure] we use a recursive definition of a \texttt{TreeNode} where a nodes parent either exists or is null (it is top of tree), and a node can have any number of children elements. This definition allows for simple traversal both up and down the tree. Which means traversing two trees at the same time can be done in the matcher and transformer section of the tool.
\begin{lstlisting}[language={JavaScript}, label={def:TreeStructure}, caption={Simple definition of a Tree structure in TypeScript}]
export class TreeNode<T> {
public parent: TreeNode<T> | null;
public element: T;
public children: TreeNode<T>[] = [];
constructor(parent: TreeNode<T> | null, element: T) {
this.parent = parent;
this.element = element;
if (this.parent) this.parent.children.push(this);
}
}
\end{lstlisting}
Placing the AST generated by Babel into this structure means utilizing the library~\cite{BabelTraverse}{Babel Traverse}. Babel Traverse uses the~\cite{VisitorPattern}{visitor pattern} to allow for traversal of the AST. While this method does not suit traversing multiple trees at the same time, it allows for very simple traversal of the tree in order to place it into our simple tree structure.
\texttt{@babel/traverse}~\cite{BabelTraverse} uses the~\cite{VisitorPattern}{visitor pattern} to visit each node of the AST in a \textit{depth first} manner, the idea of this pattern is one implements a \textit{visitor} for each of the nodes in the AST and when a specific node is visited, that visitor is then used. In the case of transferring the AST into our simple tree structure we simply have to use the same visitor for all nodes, and place that node into the tree.
Visiting a node using the \texttt{enter()} function means we went from the parent to that child node, and it should be added as a child node of the parent. The node is automatically added to its parent list of children nodes from the constructor of \texttt{TreeNode}. Whenever leaving a node the function \texttt{exit()} is called, this means we are moving back up into the tree, and we have to update what node was the \textit{last} in order to generate the correct tree structure.
\begin{lstlisting}[language={JavaScript}]
traverse(ast, {
enter(path: any) {
let node: TreeNode<t.Node> = new TreeNode<t.Node>(
last,
path.node as t.Node
);
if (last == null) {
first = node;
}
last = node;
},
exit(path: any) {
if (last && last?.element?.type != "Program") {
last = last.parent;
}
},
});
if (first != null) {
return first;
}
\end{lstlisting}
\section{Outline of transforming user code}
\begin{algorithm}[H]
\caption{Outline of steps of algorithm}\label{lst:outline}
\begin{algorithmic}[1]
\State $CA, CT, W \gets extractWildcards()$
\State $A,T \gets babel.parse(CA, CT)$ \Comment{Parse templates}
\State $C \gets babel.parse()$ \Comment{Parse user code}
\State $AT, TT, CT \gets Tree(A, T, C)$ \Comment{Build the tree structure from Babel AST}
\If{$AT.length > 1$} \Comment{Decide which matcher to use}
\State $M \gets multiMatcher(CT, AT, W)$
\Else
\State $M \gets singleMatcher(CT, AT, W)$
\EndIf
\State $TransformedTemplates \gets $ []
\For{\textbf{each} m \textbf{in} M} \Comment{Build transformation templates}
\State TransformedTemplates.insert $\gets$ buildTransform($m$, $TT$, $W$);
\EndFor
\For{\textbf{each} $t$ \textbf{in} TransformedTemplates} \Comment{Insert transformed templates}
\State traverse($C$)
\If{$t.node == C.node$}
\State $C$.replaceMany($t$);
\EndIf
\EndFor
\State \Return babel.generate($C$);
\end{algorithmic}
\end{algorithm}
Each line in \ref{lst:outline} is a step in the full algorithm for transforming user code based on a proposal specification in our tool. These steps work as follows:
\begin{description}
\item [Line 1:] Extract the wildcards from the template definitions and replace them with identifiers.
\item [Line 3:] Parse all source code into a Babel AST using \texttt{@babel/parser}~\cite{BabelParser}
\item [Line 5:] Convert the Babel AST into our own tree structure for simpler traversal of multiple trees at the same time
\item [Lines 7-12:] Based on the \texttt{applicable to} template, decide what matching function to use, and find all matching sections of the user code.
\item [Lines 14-17:] Move all matched wildcard nodes into an instance of the \texttt{transform to} template.
\item [Lines 20-26:] Insert all transformations from the previous step into the original user AST.
\item [Line 29:] Generate source code from the user AST using \texttt{@babel/generate}~\cite{BabelGenerate}.
\end{description}
\section{Matching}
This section discusses how we find matches in the users code, this is the step described in lines 5-10 of Listing \ref{lst:outline}. Firstly, we will discuss how individual nodes are compared, then how the two traversal algorithms are implemented, and how matches are discovered using these algorithms.
\paragraph*{Determining if AST nodes match.}
The initial problem we have to overcome is a way of comparing AST nodes from the template to AST nodes from the user code. This step also has to take into account comparing against wildcards and pass that information back to the AST matching algorithms.
When comparing two AST nodes in this tool, we use the function \texttt{checkCodeNode}, which will give the following values based on what kind of match these two nodes produce.
\begin{description}
\item[NoMatch:] The nodes do not match.
\item[Matched:] The nodes are a match, and the node of \texttt{applicable to} is not a wildcard.
\item[MatchedWithWildcard]: The node of the user AST produced a match against a wildcard.
\item[MatchedWithPlussedWildcard]: The node of the user AST produced a match against a wildcard that can match one or more nodes against itself.
\end{description}
When we are comparing two AST nodes, we have to perform an equality check. Due to this being a structural matching search, we can get away with just performing some preliminary checks, such as that names of identifiers, otherwise it is sufficient to just perform an equality check of the types of the nodes we are currently trying to match. If the types are the same, they can be validly matched against each other. This is sufficient because we are currently trying to determine if a single node can be a match, and not the entire template structure is a match. Therefore false positives that are not equivalent are highly unlikely due to the fact the entire structure has to be a false positive match.
There is a special case when comparing two nodes, namely when encountering a wildcard. To know if we have encountered a wildcard, the current AST node of \texttt{applicable to} will be either an \texttt{Identifier} or a \texttt{ExpressionStatement} where the expression is an \texttt{Identifier}. The reason it might be an \texttt{ExpressionStatement} is due to the wildcard extraction step, where we replace the wildcard with an identifier of the same name. Due to this replacement, we might place an identifier as a statement, the identifier will then be wrapped inside an \texttt{ExpressionStatement} AST node. If the node of \texttt{applicable to} is of either of these types, we have to check if the name of the identifier is the same as a wildcard. If it is, we have to compare the type of the user AST node against the type expression of the wildcard.
\begin{lstlisting}
if((aplToNode.type === "ExpressionStatement" &&
aplToNode.expression.type === "Identifier") ||
aplToNode.type === "Identifier"){
// Check if aplToNode is a wildcard
}
\end{lstlisting}
When comparing an AST node type against a wildcard type expression, we pass the node type into a function \texttt{WildcardEvaluator}. This evaluator will traverse through the AST of the wildcard type expression. Every leaf of the tree is equality checked against the type, and the resulting boolean value is returned. Then we solve the expression, bubbling the values through the visitor until we have traversed the entire expression, and have a result. If the result of the evaluator is \texttt{false}, we return \texttt{NoMatch}. If the result of the evaluation is \texttt{true}, we know we can match the user's AST node against the wildcard. If the wildcard type expression contains a Kleene plus, the comparison returns \texttt{MatchedWithPlussedWildcard}, if not, we return \texttt{MatchedWithWildcard}.
\subsection{Matching a single Expression/Statement template}
\label{sec:singleMatcher}
The larger and more complex the \texttt{applicable to} template is, the fewer matches it will produce, therefore using a single expression/statement as the matching template is preferred. This is because there will be a higher probability of discovering applicable code with a template that is as generic and simple as possible. A very complex matching template with many statements might result in a lower chance of finding matches in the users code. Therefore using simple, single root node matching templates provide the highest possibility of discovering a match within the users code. This section will cover line 11 of Listing \ref{lst:outline}.
Determining if we are currently matching with a template that is only a single expression/statement, we have to verify that the program body of the template has the length of one, if it does we can use the single length traversal algorithm.
There is a special case for if the template is a single expression, as the first node of the AST generated by \texttt{@babel/generate}~\cite{BabelGenerate} will be of type \texttt{ExpressionStatement}, the reason for this is Babel will treat free floating expressions as a statement. This will miss many applicable parts of the users code, because expressions within other statements are not wrapped in an \texttt{ExpressionStatement}. This will give a template that is incompatible with a lot of otherwise applicable expressions. This means the statement has to be removed, and the search has to be done with the expression as the top node of the template. If the node in the body of the template is a statement, no removal has to be done, as a statement can be used directly.
\paragraph{Discovering Matches Recursively}
The matcher used against single expression/statement templates is based Depth-First Search to traverse the trees. The algorithm can be split into two steps. The initial step is to check if we are currently at the root of the \texttt{applicable to} AST, the second is to try to match the current nodes, and start a search on each of their child nodes.
It is important we try to match against the template at all levels of the code AST, this is done by starting a new search one every child node of the code AST if the current node of the template AST is the root node. This ensures we have tried to perform a match at any level of the tree. This also ensures we have no partial matches, as we store it only if it returns a match when being called with the root node of \texttt{applicable to}.
\begin{lstlisting}[language={JavaScript}]
if(aplTo.element === this.aplToRoot){
// Start a search from root of aplTo on all child nodes
for(let codeChild of code.children){
let childMatch = singleMatcher(codeChild, aplTo);
// If it is a match, we know it is a full match and store it.
if(childMatch){
this.matches.push(childMatch);
}
}
}
\end{lstlisting}
We can now determine if we are currently exploring a match. This means the current code AST node is checked against the current node of \texttt{applicable to} AST. Based on what kind of result the comparison between these two nodes give, we have perform different steps.
\begin{description}
\item[NoMatch:] If a comparison between the nodes return a \texttt{NoMatch} result, we perform an early return of undefined, as no match was discovered. We can safely discard this search, because we have started a search at all levels of the code AST.
\item[Matched:] The current code node matches against the current node of the template, and we have to perform a search on each of the child nodes.
\item[MatchedWithWildcard:] When a comparison results in a wildcard match, we pair the current code node and the template wildcard, and do an early return. We can do this because if a wildcard matches, the nodes of the children does not matter and will be placed into the transformation.
\item[MatchedWithPlussedWildcard:] this is a special case for a wildcard match. When a match against a wildcard that has the Kleene plus tied to it we also perform an early return. This result means some special traversal has to be done to the current nodes siblings, this is described below.
\end{description}
A comparison result of \texttt{Matched} means the two nodes match, but the \texttt{applicable to} node is not a wildcard. With this case, we perform a search on each child nodes of \texttt{applicable to} AST and the user AST. This is performed in order, meaning the n-th child node of \texttt{applicable to} is checked against the n-th child node of the user AST.
When checking the child nodes, we have to check for a special case when the comparison of the child nodes result in \texttt{MatchedWithPlussedWildcard}. If this result is encountered, we have to continue matching the same \texttt{applicable to} node against each subsequent sibling node of the code node. This is because, a wildcard with a Keene plus can match against multiple sibling nodes. This behavior can bee seen in line 17-31 of Listing \ref{lst:pseudocodeChildSearch}.
If all child nodes did not give the result of NoMatch, we have successfully matched every node of the \texttt{applicable to} AST. This does not yet mean we have a match, as there might be remaining nodes in the child node of the code AST. To check for this, we check whether or not \texttt{codeI} is equal to the length of \texttt{code.children}. If it is unequal, we have not matched all child nodes of the code AST and have to return \texttt{NoMatch}. This can be seen on lines 37-39 of Listing \ref{lst:pseudocodeChildSearch}.
\begin{lstlisting}[language={JavaScript}, caption={Pseudocode of child node matching}, label={lst:pseudocodeChildSearch}]
let codeI = 0;
let aplToI = 0;
while (aplToI < aplTo.children.length && codeI < code.children.length){
let [pairedChild, childResult] = singleMatcher(code.children[codeI], aplTo.children[aplToI]);
// If a child does not match, the entire match is discarded
if(childResult === NoMatch){
return [undefined, NoMatch];
}
// Add the match to the current Paired Tree structure
pairedChild.parent = currentPair;
currentPair.children.push(pairedChild);
// Special case for Keene plus wildcard match
if(childResult === MatchedWithPlussedWildcard){
codeI += 1;
while(codeI < code.children.length){
let [nextChild, plusChildResult] = singleMatcher(code.children[codeI], aplTo.children[aplToI]);
if(plusChildResult !== MatchedWithPlussedWildcard){
i -= 1;
break;
}
pairedChild.element.codeNode.push(...nextChild.element.codeNode);
codeI += 1;
}
}
codeI += 1;
aplToi += 1;
}
if(codeI !== code.children.length){
return [undefined, NoMatch]
}
return [currentPair, Match];
\end{lstlisting}
\subsection{Matching multiple Statements}
Using multiple statements in the template of \texttt{applicable to} means the tree of \texttt{applicable to} as multiple root nodes, to perform a match with this kind of template, we use a sliding window~\cite{SlidingWindow} with size equal to the amount statements in the template. This window is applied at every \textit{BlockStatement} and \texttt{Program} of the code AST, as that is the only placement statements can reside in JavaScript~\cite[14]{ecma262}.
The initial step of this algorithm is to search through the AST for ast nodes that contain a list of \textit{Statements}. Searching the tree is done by Depth-First search, at every level of the AST, we check the type of the node. Once a node of type \texttt{BlockStatement} or \texttt{Program} is discovered, we start the trying to match the statements.
\begin{lstlisting}[language={JavaScript}]
multiStatementMatcher(code, aplTo) {
if (
code.element.type === "Program" ||
code.element.type === "BlockStatement"
) {
matchMultiHead(code.children, aplTo.children);
}
for (let code_child of code.children) {
multiStatementMatcher(code_child, aplTo);
}
}
\end{lstlisting}
\texttt{matchMultiHead} uses a sliding window~\cite{SlidingWindow}. The sliding window will try to match every statement of the code AST against its corresponding statement in the \texttt{applicable to} AST. For every statement, we perform a DFS recursion algorithm is applied, similar to algorithm used in Section \ref{sec:singleMatcher}, however this search is not applied to all levels, and if it matches it has to match fully and immediately. If a match is not found, the current iteration of the sliding window is discarded and we move on to the next iteration by moving the window one further.
One important case here is we might not know the width of the sliding window, this is due to wildcards using the Keene plus, as they can match one or more nodes against the wildcard. These wildcards might match against \texttt{(Statement)+}. Therefore, we use a similar technique to the one described in Section \ref{sec:singleMatcher}, where we have two pointers and match each statement depending on each pointer.
\subsection*{Output of the matcher}
The matches discovered have to be stored such that we can easily find all the nodes that were matched against wildcards and transfer them into the transformation later. To make this simpler, we make use an object \texttt{PairedNodes}. This object allows us to easily find exactly what nodes were matched against each other. The matcher will place this object into the same tree structure described in \ref{sec:BabelParse}. This means the result of running the matcher on the user code is a list of \texttt{TreeNode<PairedNode>}.
\begin{lstlisting}[language={JavaScript}]
interface PairedNode{
codeNode: t.Node[],
aplToNode: t.Node
}
\end{lstlisting}
Since a match might be multiple statements, we use an interface \texttt{Match}, that contains separate tree structures of \texttt{PairedNodes}. This allows storage of a match with multiple root nodes.
\begin{lstlisting}[language={JavaScript}]
export interface Match {
// Every matching Statement in order with each pair
statements: TreeNode<PairedNodes>[];
}
\end{lstlisting}
\section{Transforming}
To perform the transformation and replacement on each of the matches, we take the resulting list of matches, the template from the \texttt{transform to} section of the current case of the proposal, and the Babel AST~\cite{BabelAST} version of original code. All the transformations are then applied to the code and we use \texttt{@babel/generate}~\cite{BabelGenerate} to generate JavaScript code from the transformed AST.
An important discovery is to ensure we transform the leafs of the AST first, this is because if the transformation was applied from top to bottom, it might remove transformations done using a previous match. This means if we transform from top to bottom on the tree, we might end up with \texttt{a(b) |> c(\%)} in stead of \texttt{b |> a(\%) |> c(\%)} in the case of the pipeline proposal. This is quite easily solved in our case, as the matcher looks for matches from the top of the tree to the bottom of the tree, the matches it discovers are always in that order. Therefore when transforming, all that has to be done is reverse the list of matches, to get the ones closest to the leaves of the tree first.
\subsubsection{Building the transformation}
Before we can start to insert the \texttt{transform to} section into the user's code AST. We have to insert all nodes matched against a wildcard in \texttt{applicable to} into their reference locations.
The first step to achieve this is to extract the wildcards from the match tree. This is done by recursively searching the match tree for an \texttt{Identifier} or \texttt{ExpressionStatement} containing an \texttt{Identifier}. To do this, we have a function \texttt{extractWildcardPairs}, which takes a single match, and extracts all wildcards and places them into a \texttt{Map<string, t.Node[]>}. Where the key of the map is the identifier used for the wildcard, and the value is the AST nodes the wildcard was matched against in the users code.
\begin{lstlisting}[language={JavaScript}, caption={Extracting wildcard from match}, label={lst:extractWildcardFromMatch}]
function extractWildcardPairs(match: Match): Map<string, t.Node[]> {
let map: Map<string, t.Node[]> = new Map();
function recursiveSearch(node: TreeNode<PairedNodes>) {
let name: null | string = null;
if (node.element.aplToNode.type === "Identifier") {
name = node.element.aplToNode.name;
} else if (
// Node is ExpressionStatement with Identifier
) {
name = node.element.aplToNode.expression.name;
}
if (name) {
// Store in the map
map.set(name, node.element.codeNode);
}
// Recursively search the child nodes
for (let child of node.children) {
recursiveSearch(child);
}
}
// Start the initial search
for (let stmt of match.statements) {
recursiveSearch(stmt);
}
return map;
}
\end{lstlisting}
Once the full map of all wildcards has been built, we have to insert the wildcards into the Babel AST of the \texttt{transform to} template. To do this, we have to traverse the template and insert the matched nodes of the user's code. We use \texttt{@babel/traverse}~\cite{BabelTraverse} to traverse the AST, as this provides us with a powerful API for modifying the AST. \texttt{@babel/traverse} allows us to define visitors, that are executed when traversing specific types of AST nodes. For this, we define a visitor for \texttt{Identifier}, and a visitor for \texttt{ExpressionStatement}. These visitors will do exactly the same, however for the \texttt{ExpressionStatement}, we have to check if the expression is an identifier.
When we visit a node that might be a wildcard, we check if that nodes name is in the map of wildcards built in Listing \ref{lst:extractWildcardFromMatch}. If the name of the identifier is a key in the wildcard, we get the value for that key, and perform a node replacement. Where we replace the identifier with the node from the user's code that was matched against that wildcard. See Listing \ref{lst:traToTransform}
\begin{lstlisting}[language={JavaScript}, caption={Traversing \texttt{transform to} AST and inserting user context}, label={lst:traToTransform}]
traverse(transformTo, {
Identifier: (path) => {
if (wildcardMatches.has(path.node.name)) {
let toReplaceWith = wildcardMatches.get(path.node.name);
if (toReplaceWith) {
path.replaceWithMultiple(toReplaceWith);
}
}
},
ExpressionStatement: (path) => {
if (path.node.expression.type === "Identifier") {
let name = path.node.expression.name;
if (wildcardMatches.has(name)) {
let toReplaceWith = wildcardMatches.get(name);
if (toReplaceWith) {
path.replaceWithMultiple(toReplaceWith);
}
}
}
},
});
\end{lstlisting}
Due to some wildcards allowing matching of multiple sibling nodes, we have to use \texttt{replaceWithMultiple} when performing the replacement. This can be seen on line 6 and 16 of Listing \ref{lst:traToTransform}.
\subsubsection*{Inserting the template into the AST}
We have now created the \texttt{transform to} template with the user's context. This has to be inserted into the full AST definition of the users code. To do this we have to locate exactly where in the user AST this match originated. We can perform an equality check on the top noe of the user node stored in the match. To do this efficiently, we perform this check by using this top node as the key to a \texttt{Map}, so if a node in the user AST exists in that map, we know it was matched.
\begin{lstlisting}[language={JavaScript}]
transformedTransformTo.set(
match.statements[0].element.codeNode[0],
[
transformMatchFaster(wildcardMatches, traToWithWildcards),
match,
]
);
\end{lstlisting}
To traverse the user AST, we use \texttt{@babel/traverse}~\cite{BabelTraverse}. In this case we cannot use a specific visitor, and therefore we use a generic visitor that applies to every node of the AST. If the current node we are visiting is a key to the map of transformations, we know we have to insert the transformed code. This is done similarly to before where we use \texttt{replaceWithMultiple}.
Some matches have multiple root nodes. This is likely when matching was done with multiple statements as top nodes. This means we have to remove n-1 following sibling nodes. Removal of these sibling nodes can be seen on lines 12-15 of Listing \ref{lst:insertingIntoUserCode}.
\begin{lstlisting}[language={JavaScript}, caption={Inserting transformed matches into user code}, label={lst:insertingIntoUserCode}]
traverse(codeAST, {
enter(path) {
if (transformedTransformTo.has(path.node)) {
let [traToWithWildcards, match] =
transformedTransformTo.get(path.node) as [
t.File,
Match
];
path.replaceWithMultiple(
traToWithWildcards.program.body);
let siblings = path.getAllNextSiblings();
// For multi line applicable to
for (let i = 0; i < match.statements.length - 1; i++) {
siblings[i].remove();
}
// When we have matched top statements with +, we might have to remove more siblings
for (let matchStmt of match.statements) {
for (let codeStmt of matchStmt.element
.codeNode) {
let siblingnodes = siblings.map((a) => a.node);
if (siblingnodes.includes(codeStmt)) {
let index = siblingnodes.indexOf(codeStmt);
siblings[index].remove();
}
}
}
}
},
});
\end{lstlisting}
There is a special case when a wildcard with a Keene plus, allowing the match of multiple siblings, means we might have more siblings to remove. In this case, it is not so simple to know exactly how many we have to remove. Therefore, we have to iterate over all statements of the match, and check if that statement is still a sibling of the current one being replace. This behavior can be seen on lines 20-29 of Listing \ref{lst:insertingIntoUserCode}.
After one full traversal of the user AST. All matches found have been replaced with their respective transformation. All that remains is generating JavaScript from the transformed AST.
\subsubsection*{Generating source code from transformed AST}
To generate JavaScript from the transformed AST created by this tool, we use a JavaScript library titled~\cite{BabelGenerate}{babel/generator}. This library is specifically designed for use with Babel to generate JavaScript from a Babel AST. The transformed AST definition of the users code is transformed, while being careful to apply all Babel plugins the current proposal might require.