Almost finished with ch4, re-wrote the entire Matching chapter

This commit is contained in:
Rolf Martin Glomsrud 2024-05-27 22:54:59 +02:00
parent 281893a474
commit 28e029a369
3 changed files with 179 additions and 47 deletions

View file

@ -1 +1,4 @@
In the chapter on parsing of the wildcards. What do you mean by the exlamation mark of repetition?
Combine them!
Kleenie plus

Binary file not shown.

View file

@ -30,7 +30,7 @@ In the architecture diagram, circular nodes show data passed into the program se
roundnode/.style={ellipse, draw=red!60, fill=red!5, very thick, minimum size=7mm},
squarednode/.style={rectangle, draw=red!60, fill=red!5, very thick, minimum size=5mm}
]
\node[squarednode] (preParser) {2. Pre-parser};
\node[squarednode] (preParser) {2. Wildcard Extraction};
\node[squarednode] (preludebuilder) [above right=of preParser] {1. Prelude Builder};
\node[roundnode] (selfhostedjsoninput) [above=of preludebuilder] {Self-Hosted Object};
\node[squarednode] (langium) [above left=of preParser] {1. Langium Parser};
@ -267,7 +267,7 @@ The most important reason for choosing to use Babel for the purpose of generatin
\subsection*{Custom Tree Structure}
To allow for matching and transformations to be applied to each of the sections inside a \texttt{pair} definition, they have to be parsed into and AST in order to allow the tool to match and transform accordingly. To do this the tool uses the library \cite[Babel]{Babel} to generate an AST data structure. However, this structure does not suit traversing multiple trees at the same time, this is a requirement for matching and transforming. Therefore we use this Babel AST and transform it into a simple custom tree structure to allow for simple traversal of the tree.
Performing matching and transformation on each of the sections inside a \texttt{case} definition, they have to be parsed into and AST in order to allow the tool to match and transform accordingly. To do this the tool uses the library \cite[Babel]{Babel} to generate an AST data structure. However, this structure does not suit traversing multiple trees at the same time, this is a requirement for matching and transforming. Therefore we use this Babel AST and transform it into a simple custom tree structure to allow for simple traversal of the tree.
As can be seen in \figFull[def:TreeStructure] we use a recursive definition of a \texttt{TreeNode} where a nodes parent either exists or is null (it is top of tree), and a node can have any number of children elements. This definition allows for simple traversal both up and down the tree. Which means traversing two trees at the same time can be done in the matcher and transformer section of the tool.
@ -317,60 +317,204 @@ traverse(ast, {
\end{lstlisting}
\section{Outline of transforming user code}
\begin{lstlisting}[language={JavaScript}, caption={Outline of the algorithm}, label={lst:outline}]
let wildcards = ExtractWildcards();
ParseWithBabel(); // Parse all JavaScript source code with Babel
MakeTrees(); // Build the tree structure from Babel AST
// Search the user code for matches to applicable to
if(applicableTo.program.length > 1){
let matches = multiStmtMatcher(codeTree, applicableTo, wildcards);
}else{
let matches = singleMatcher(codeTree, ApplicableTo, wildcards);
}
// Build the transformation with code matched to wildcards
for(let match of matches){
let transformed = buildTransform(match, transformTo, wildcards);
}
// Replace original matched sections with transformed section
for(let transform of transformed){
traverse(codeAST){
if(transform.node === codeAST.node){
codeAST.replace(transform);
}
}
}
// Generate code from the AST
return babel.generate(codeAST);
\end{lstlisting}
Each line in \ref{lst:outline} is a step in the full algorithm for transforming user code based on a proposal specification in our tool. These steps work as follows:
\begin{description}
\item [Line 1:] Extract the wildcards from the template definitions and replace them with identifiers.
\item [Line 3:] Parse all source code into a Babel AST using \texttt{@babel/parser}\cite{BabelParser}
\item [Line 5:] Convert the Babel AST into our own tree structure for simpler traversal of multiple trees at the same time
\item [Lines 7-12:] Based on the \texttt{applicable to} template, decide what matching function to use, and find all matching sections of the user code.
\item [Lines 14-17:] Move all matched wildcard nodes into an instance of the \texttt{transform to} template.
\item [Lines 20-26:] Insert all transformations from the previous step into the original user AST.
\item [Line 29:] Generate source code from the user AST using \texttt{@babel/generate}\cite{BabelGenerate}.
\end{description}
\section{Matching}
Performing the match against the users code it the most important step, as if no matching code is found the tool will do no transformations. Finding the matches will depend entirely on how well the definition of the proposal is written, and how well the proposal actually can be defined within the confines of \DSL. In this chapter we will discuss how matching is performed based on the definition of \texttt{applicable to}
\subsection*{Determining if AST nodes match}
The initial problem we have to overcome is a way of comparing AST nodes from the template to AST nodes from the user code. This step also has to take into account comparing against wildcards and pass that information back to the AST matching algorithms.
This section discusses how we find matches in the users code, this is the step described in lines 5-10 of Listing \ref{lst:outline}. Firstly, we will discuss how individual nodes are compared, then how the two traversal algorithms are implemented, and how matches are discovered using these algorithms.
In the pre-parsing step of \DSL we are replacing each of the wildcards with an expression of type Identifier, this means we are inserting an Identifier at either a location where an expression resides, or a statement. In the case of the identifier being placed where a statement should reside, it will be wrapped in an ExpressionStatement. This has to be taken into account when comparing statement nodes from the template and user code, as if we encounter an ExpressionStatement, its corresponding expression has to be checked for if it is an Identifier.
Since a wildcard is replaced by an Identifier, when matching a node in the template, we have to check if it is the \textit{Identifier} or \textit{ExpressionStatement} with an identifier contained within, if there is an identifier, we have to check if that identifier is a registered wildcard. If an Identifier shares a name with a wildcard, we have to compare the node against the Type expression of that wildcard. When we do this, we traverse the entirety of the wildcard expression AST and compare each of the leaves against the type of the current code node. These resulting values are then passed through the type expression and the resulting value is whether or not that code node can be matched against the wildcard. We differentiate between if a node matched against a wildcard with the \texttt{+} notation, as if that is the case we have to keep using that wildcard until it returns false in the tree exploration algorithms.
\paragraph*{Determining if AST nodes match.}
When we are either matching against an Identifier that is not a registered wildcard, or any other AST node in the template, we have to perform an equality check, in the case of this template language, we can get away with just performing some preliminary checks, such as that names of Identifiers are the same. Otherwise it is sufficient to just perform an equality check of the types of the nodes we are currently trying to match. If the types are the same, they can be validly matched against each other. This is sufficient because we are currently trying to determine if a single node can be a match, and not the entire template structure is a match. Therefore false positives that are not equivalent are highly unlikely due to the fact the entire structure has to be a false positive match.
The initial problem we have to overcome is a way of comparing AST nodes from the template to AST nodes from the user code. This step also has to take into account comparing against wildcards and pass that information back to the AST matching algorithms.
The function used for matching singular nodes will give different return values based on how they were matched. The results NoMatch and Matched are self explanatory, they are used when either no match is found, or if the nodes types match and the template node is not a wildcard. When we are matching against a wildcard, if it is a simple wildcard that cannot match against multiple nodes of the code, the result will be \texttt{MatchedWithWildcard}. If the wildcard used to match is a one or many wildcard, the result will be \texttt{MatchedWithPlussedWildcard}, as this shows the recursive traversal algorithm used that this node of the template have to be tried against the code nodes sibling.
When comparing two AST nodes in this tool, we use the function \texttt{checkCodeNode}, which will give the following values based on what kind of match these two nodes produce.
\begin{description}
\item[NoMatch:] The nodes do not match.
\item[Matched:] The nodes are a match, and the node of \texttt{applicable to} is not a wildcard.
\item[MatchedWithWildcard]: The node of the user AST produced a match against a wildcard.
\item[MatchedWithPlussedWildcard]: The node of the user AST produced a match against a wildcard that can match one or more nodes against itself.
\end{description}
When we are comparing two AST nodes, we have to perform an equality check. Due to this being a structural matching search, we can get away with just performing some preliminary checks, such as that names of identifiers, otherwise it is sufficient to just perform an equality check of the types of the nodes we are currently trying to match. If the types are the same, they can be validly matched against each other. This is sufficient because we are currently trying to determine if a single node can be a match, and not the entire template structure is a match. Therefore false positives that are not equivalent are highly unlikely due to the fact the entire structure has to be a false positive match.
There is a special case when comparing two nodes, namely when encountering a wildcard. To know if we have encountered a wildcard, the current AST node of \texttt{applicable to} will be either an \texttt{Identifier} or a \texttt{ExpressionStatement} where the expression is an \texttt{Identifier}. The reason it might be an \texttt{ExpressionStatement} is due to the wildcard extraction step, where we replace the wildcard with an identifier of the same name. Due to this replacement, we might place an identifier as a statement, the identifier will then be wrapped inside an \texttt{ExpressionStatement} AST node. If the node of \texttt{applicable to} is of either of these types, we have to check if the name of the identifier is the same as a wildcard. If it is, we have to compare the type of the user AST node against the type expression of the wildcard.
\begin{lstlisting}
enum MatchResult {
MatchedWithWildcard,
MatchedWithPlussedWildcard,
Matched,
NoMatch,
if((aplToNode.type === "ExpressionStatement" &&
aplToNode.expression.type === "Identifier") ||
aplToNode.type === "Identifier"){
// Check if aplToNode is a wildcard
}
\end{lstlisting}
\subsection*{Matching a singular Expression/Statement template}
When comparing an AST node type against a wildcard type expression, we pass the node type into a function \texttt{WildcardEvaluator}. This evaluator will traverse through the AST of the wildcard type expression. Every leaf of the tree is equality checked against the type, and the resulting boolean value is returned. Then we solve the expression, bubbling the values through the visitor until we have traversed the entire expression, and have a result. If the result of the evaluator is \texttt{false}, we return \texttt{NoMatch}. If the result of the evaluation is \texttt{true}, we know we can match the user's AST node against the wildcard. If the wildcard type expression contains a Kleene plus, the comparison returns \texttt{MatchedWithPlussedWildcard}, if not, we return \texttt{MatchedWithWildcard}.
The method of writing the \texttt{applicable to} section using a singular simple expression/statement is by far the most versatile way of defining matching template, this is because there will be a higher probability of discovering applicable code with a template that is as generic and simple as possible. A very complex matching template with many statements or an expression containing many AST nodes will result in a lower chance of finding a resulting match in the users code. Therefore using simple, single root node matching templates provide the highest possibility of discovering a match within the users code.
\subsection{Matching a single Expression/Statement template}
\label{sec:singleMatcher}
Determining if we are currently trying to match with a template that is only a single expression/statement, we have to verify that the program body of the template has the length of 1, if it does we can use the singular expression matcher, if not, we have to rely on the matcher that can handle multiple statements at the head of the tree.
The larger and more complex the \texttt{applicable to} template is, the fewer matches it will produce, therefore using a single expression/statement as the matching template is preferred. This is because there will be a higher probability of discovering applicable code with a template that is as generic and simple as possible. A very complex matching template with many statements might result in a lower chance of finding matches in the users code. Therefore using simple, single root node matching templates provide the highest possibility of discovering a match within the users code. This section will cover line 11 of Listing \ref{lst:outline}.
When matching an expression the first statement in the program body of the AST generated when using \cite{BabelGenerate}{babel generate} will be of type \texttt{ExpressionStatement}, the reason for this is Babel will treat free floating expressions as a statement, and place them into an ExpressionStatement. This will miss many applicable sections in the case of trying to match against a users code because expressions within other statements are not inside an ExpressionStatement. This will give a template that is incompatible with a lot of otherwise applicable expressions. This means the statement ExpressionStatement has to be removed, and the search has to be done with the expression as the top node of the template.
Determining if we are currently matching with a template that is only a single expression/statement, we have to verify that the program body of the template has the length of one, if it does we can use the single length traversal algorithm.
In the case of the singular node in the body of the template program being a Statement, no removal has to be done, as a Statement can be used directly.
There is a special case for if the template is a single expression, as the first node of the AST generated by \texttt{@babel/generate}\cite{BabelGenerate} will be of type \texttt{ExpressionStatement}, the reason for this is Babel will treat free floating expressions as a statement. This will miss many applicable parts of the users code, because expressions within other statements are not wrapped in an \texttt{ExpressionStatement}. This will give a template that is incompatible with a lot of otherwise applicable expressions. This means the statement has to be removed, and the search has to be done with the expression as the top node of the template. If the node in the body of the template is a statement, no removal has to be done, as a statement can be used directly.
\paragraph{Recursively discovering matches}
The matcher used against single Expression/Statement templates is based upon a Depth-First Search in order to perform matching, and searches for matches from the top of the code definition. It is important we try to match against the template at all levels of the code AST, this is done by starting a new search one every child node of the code AST if the current node of the template tree is the top node of the template. This ensures we have tried to perform a match at any level of the tree, this also means we do not get any partial matches, as we only store matches that are returned at the recursive call when we do the search from the first node of the template tree.
This is all done before ever checking the node we are currently on. The reason for this is to avoid missing matches that reside further down in the current branch, and also ensure matches further down are placed earlier in the full match array, which makes it easier to perform transformation when partial collisions exist.
\paragraph{Discovering Matches Recursively}
Once we have started a search on all the child nodes of the current one using the full definition of \texttt{applicable to}, we can verify if we are currently exploring a match. This means the current node is checked against the current top node of \texttt{applicable to}, if said node is a match, based on what kind of match it is several different parts of the algorithm are called. This is because there are different forms of matches depending on if it is a match against a wildcard, a wildcard with \texttt{+}, or simply a node type match.
The matcher used against single expression/statement templates is based Depth-First Search to traverse the trees. The algorithm can be split into two steps. The initial step is to check if we are currently at the root of the \texttt{applicable to} AST, the second is to try to match the current nodes, and start a search on each of their child nodes.
If the current node matches against a wildcard that does not use the \texttt{+} operator, we simply pair the current template node to the matched node from the users code and return. This is because whatever the current user node contains, it is being matched against a wildcard and that means no matter what is below it, it is meant to be placed directly into the transformation. Therefore we can determine that this is a match that is valid.
When the current node is matched against a wildcard that does use the \texttt{+} operator, we have to continue trying to match against that same wildcard with the sibling nodes of the current code node. This is performed in the recursive iteration above the current one, and therefore we also return the paired AST nodes of the template and the code, but we give the match result \texttt{MatchResult.MatchedWithPlussedWildcard} to the caller function. When the caller function gets this result, it will continue trying to match against the wildcard until it receives a different match result other than \texttt{MatchResult.MatchedWithPlussedWildcard}.
It is important we try to match against the template at all levels of the code AST, this is done by starting a new search one every child node of the code AST if the current node of the template AST is the root node. This ensures we have tried to perform a match at any level of the tree. This also ensures we have no partial matches, as we store it only if it returns a match when being called with the root node of \texttt{applicable to}.
When the current node is matched based on the types of the current AST nodes, some parts have to hold. Namely, all child nodes of the template and the user code have to also return some form of match, this means if any of the child nodes currently return \texttt{MatchResult.NoMatch} the entire match is discarded. The number of child nodes of the current match also has to be equal. Due to wildcards this means we have to be able to match all child nodes of the user code to either a single node of the template, or a wildcard using the \texttt{+} operator.
\begin{lstlisting}[language={JavaScript}]
if(aplTo.element === this.aplToRoot){
// Start a search from root of aplTo on all child nodes
for(let codeChild of code.children){
let childMatch = singleMatcher(codeChild, aplTo);
If the current node does not match, we simply discard the current search, as we have already started a search from the start of the template at all levels of the user code AST, we can safely end the search and rely on these to find matches further down in the tree.
// If it is a match, we know it is a full match and store it.
if(childMatch){
this.matches.push(childMatch);
}
}
}
\end{lstlisting}
To allow for easier transformation, and storage of what exact part of \texttt{applicable to} was matched against the exact node of the code AST, we use a custom instance of the simple tree structure described in \ref*{sec:BabelParse}, we use an interface \texttt{PairedNode}, this allows us to hold what exact nodes were matched together, this allows for a simpler transforming algorithm. The exact definition of \texttt{PairedNode} can be seen below. The reason the codeNode is a list, is due to wildcards allowing for multiple AST nodes to match against, as they might match multiple nodes of the user code against a single node of the template.
We can now determine if we are currently exploring a match. This means the current code AST node is checked against the current node of \texttt{applicable to} AST. Based on what kind of result the comparison between these two nodes give, we have perform different steps.
\begin{description}
\item[NoMatch:] If a comparison between the nodes return a \texttt{NoMatch} result, we perform an early return of undefined, as no match was discovered. We can safely discard this search, because we have started a search at all levels of the code AST.
\item[Matched:] The current code node matches against the current node of the template, and we have to perform a search on each of the child nodes.
\item[MatchedWithWildcard:] When a comparison results in a wildcard match, we pair the current code node and the template wildcard, and do an early return. We can do this because if a wildcard matches, the nodes of the children does not matter and will be placed into the transformation.
\item[MatchedWithPlussedWildcard:] this is a special case for a wildcard match. When a match against a wildcard that has the Kleene plus tied to it we also perform an early return. This result means some special traversal has to be done to the current nodes siblings, this is described below.
\end{description}
A comparison result of \texttt{Matched} means the two nodes match, but the \texttt{applicable to} node is not a wildcard. With this case, we perform a search on each child nodes of \texttt{applicable to} AST and the user AST. This is performed in order, meaning the n-th child node of \texttt{applicable to} is checked against the n-th child node of the user AST.
When checking the child nodes, we have to check for a special case if the comparison of the child nodes result in \texttt{MatchedWithPlussedWildcard}. If this result is encountered, we have to continue matching the same \texttt{applicable to} node against each subsequent sibling node of the code AST. This is because, a wildcard with a Keene plus can match against multiple sibling nodes. This behavior can bee seen in line 17-31 of Listing \ref{lst:pseudocodeChildSearch}.
If all child nodes did not give the result of NoMatch, we have successfully matched every node of the \texttt{applicable to} AST. This does not yet mean we have a match, as there might be remaining nodes in the child node of the code AST. To check for this, we check whether or not \texttt{codeI} is equal to the length of \texttt{code.children}. If it is unequal, we have not matched all child nodes of the code AST and have to return \texttt{NoMatch}. This can be seen on lines 37-39 of Listing \ref{lst:pseudocodeChildSearch}.
\begin{lstlisting}[language={JavaScript}, caption={Pseudocode of child node matching}, label={lst:pseudocodeChildSearch}]
let codeI = 0;
let aplToI = 0;
while (aplToI < aplTo.children.length && codeI < code.children.length){
let [pairedChild, childResult] = singleMatcher(code.children[codeI], aplTo.children[aplToI]);
// If a child does not match, the entire match is discarded
if(childResult === NoMatch){
return [undefined, NoMatch];
}
// Add the match to the current Paired Tree structure
pairedChild.parent = currentPair;
currentPair.children.push(pairedChild);
// Special case for Keene plus wildcard match
if(childResult === MatchedWithPlussedWildcard){
codeI += 1;
while(codeI < code.children.length){
let [nextChild, plusChildResult] = singleMatcher(code.children[codeI], aplTo.children[aplToI]);
if(plusChildResult !== MatchedWithPlussedWildcard){
i -= 1;
break;
}
pairedChild.element.codeNode.push(...nextChild.element.codeNode);
codeI += 1;
}
}
codeI += 1;
aplToi += 1;
}
if(codeI !== code.children.length){
return [undefined, NoMatch]
}
return [currentPair, Match];
\end{lstlisting}
\subsection{Matching multiple Statements}
Using multiple statements in the template of \texttt{applicable to} means the tree of \texttt{applicable to} as multiple root nodes, to perform a match with this kind of template, we use a sliding window\cite{SlidingWindow} with size equal to the amount statements in the template. This window is applied at every \textit{BlockStatement} and \texttt{Program} of the code AST, as that is the only placement statements can reside in JavaScript\cite{ECMA262Statement}.
The initial step of this algorithm is to search through the AST for ast nodes that contain a list of \textit{Statements}. Searching the tree is done by Depth-First search, at every level of the AST, we check the type of the node. Once a node of type \texttt{BlockStatement} or \texttt{Program} is discovered, we start the trying to match the statements.
\begin{lstlisting}[language={JavaScript}]
multiStatementMatcher(code, aplTo) {
if (
code.element.type === "Program" ||
code.element.type === "BlockStatement"
) {
matchMultiHead(code.children, aplTo.children);
}
for (let code_child of code.children) {
multiStatementMatcher(code_child, aplTo);
}
}
\end{lstlisting}
\texttt{matchMultiHead} uses a sliding window \cite{SlidingWindow}. The sliding window will try to match every statement of the code AST against its corresponding statement in the \texttt{applicable to} AST. For every statement, we perform a DFS recursion algorithm is applied, similar to algorithm used in Section \ref{sec:singleMatcher}, however this search is not applied to all levels, and if it matches it has to match fully and immediately. If a match is not found, the current iteration of the sliding window is discarded and we move on to the next iteration by moving the window one further.
One important case here is we might not know the width of the sliding window, this is due to wildcards using the Keene plus, as they can match one or more nodes against the wildcard. These wildcards might match against \texttt{(Statement)+}. Therefore, we use a similar technique to the one described in Section \ref{sec:singleMatcher}, where we have two pointers and match each statement depending on each pointer.
\subsection*{Output of the matcher}
The matches discovered have to be stored such that we can easily find all the nodes that were matched against wildcards and transfer them into the transformation later. To make this simpler, we make use an object \texttt{PairedNodes}. This object allows us to easily find exactly what nodes were matched against each other. The matcher will place this object into the same tree structure described in \ref{sec:BabelParse}. This means the result of running the matcher on the user code is a list of \texttt{TreeNode<PairedNode>}.
\begin{lstlisting}[language={JavaScript}]
interface PairedNode{
codeNode: t.Node[],
@ -378,22 +522,7 @@ interface PairedNode{
}
\end{lstlisting}
\subsection*{Matching multiple Statements}
Using multiple statements in the template of \texttt{applicable to} will result in a much stricter matcher, that will only try to perform an exact match using a \cite{SlidingWindow}{sliding window} of the amount of statements at every \textit{BlockStatement}, as that is the only placement Statements can reside in JavaScript\cite{ECMA262Statement}.
The initial step of this algorithm is to search through the AST for ast nodes that contain a list of \textit{Statements}, this can be done by searching for the AST nodes \textit{Program} and \textit{BlockStatement}, as these are the only valid places for a list of Statements to reside \cite{ECMA262Statement}. Searching the tree is quite simple, as all that is required is checking the type of every node recursively, and once a node that can contain multiple Statements, we check it for matches.
Once a list of \textit{Statements} has been discovered, the function \texttt{matchMultiHead} can be executed with that block and the Statements of \texttt{applicable to}.
This function will use the technique \cite{SlidingWindow}{sliding window} to match multiple statements in order the same length as the list of statements are in \texttt{applicable to}. This sliding window will try to match every Statement against its corresponding Statement in the current \textit{BlockStatement}. When matching a singular Statements in the sliding window, a simple DFS recursion algorithm is applied, similar to algorithm used for matching a single expression/statement template, however the difference is that we do not search the entire AST tree, and if it matches it has to match fully and immediately. If a match is not found, the current iteration of the sliding window is discarded and we move on to the next iteration by moving the window one further.
One important case here is we might not know the width of the sliding window, this is due to wildcards using the \texttt{+}, as they can match one or more nodes against each other. These wildcards might match against \texttt{(Statement)+}. Therefore, we have to use a two point technique when iterating through the statements of the users code. As we might have to use the same statement from the template multiple times.
\subsection*{Output of the matcher}
The resulting output of the matcher after finding all available matches, is a two dimensional array of each match, where for every match there is a list of statements in AST form, where paired ASTs from \texttt{applicable to} and the users code can be found. This means that for every match, we might be transforming and replacing multiple statements in the transformation function.
Since a match might be multiple statements, we use an interface \texttt{Match}, that contains separate tree structures of \texttt{PairedNodes}. This allows storage of a match with multiple root nodes.
\begin{lstlisting}[language={JavaScript}]
export interface Match {