chapter 4 mostly finished, need A LOT of polishing

This commit is contained in:
Rolf Martin Glomsrud 2024-05-14 20:42:24 +02:00
parent 536a847e92
commit 454b175aba
5 changed files with 163 additions and 8 deletions

2
.vscode/meeting.txt vendored
View file

@ -17,3 +17,5 @@ https://www.dropbox.com/scl/fi/q8bzwwlozn91qbrqnceen/ij-jetbrains-blog-post.pdf?
https://sourcegraph.com/blog/going-beyond-regular-expressions-with-structural-code-search
5 from this one ^
Write about More JS parsers in Related Work

Binary file not shown.

View file

@ -223,6 +223,7 @@ for (let stmt of stmts) {
\end{lstlisting}
\section{Using Babel to parse}
\label{sec:BabelParse}
Allowing the tool to perform transformations of code requires the generation of an Abstract Syntax Tree from the users code, \texttt{applicable to} and \texttt{transform to}. This means parsing JavaScript into an AST, in order to do this we use a tool \cite[Babel]{Babel}.
@ -287,17 +288,137 @@ traverse(ast, {
Performing the match against the users code it the most important step, as if no matching code is found the tool will do no transformations. Finding the matches will depend entirely on how well the definition of the proposal is written, and how well the proposal actually can be defined within the confines of \DSL. In this chapter we will discuss how matching individual AST nodes to each other, and how wildcard matching is performed.
\subsection*{Matching singular Expression}
\subsection*{Matching singular Expression/Statement}
The method of writing the \texttt{applicable to} section using a singular expression is by far the most versatile way of defining a proposal, this is simply because there will be a much higher chance of discovering matches with a template that is as generic as possible. Therefore only matching against a single expression ensures the matcher tries to perform a match at every level of the AST.
The method of writing the \texttt{applicable to} section using a singular simple expression/statement is by far the most versatile way of defining a proposal, this is simply because there will be a much higher chance of discovering matches with a template that is as generic and simple as possible. Therefore only matching against a single expression/statement ensures the matcher tries to perform a match at every level of the AST. This of course relies on that the expression/statement used to match against is as simple as possible, in order to easily find matches in user code.
\subsection*{Matching Statements}
When the template of \texttt{applicable to} is a single expression, parsing it with \cite{Babel}{Babel} will produce an AST containing a single node of the \textit{Program} body, namely an \textit{ExpressionStatement}. The reason for this is Babel will treat an expression not bound to a Statement as an ExpressionStatement, this is a problem because we cannot use an ExpressionStatement to match against any expression that are already part of some statement, for example a VariableDeclaration. In order to solve this we have to remove the AST node ExpressionStatement and only do the match against its Sub-Expression. This of course is not a requirement for matching against a single statement, as then it is expected that the user code will be \textit{similar}.
In order to determine if we are matching against an expression or a statement we can verify the type of the first statement in the program body of the AST generated when using \cite{BabelGenerate}{babel generate} \texttt{applicable to}. If the first statement is of type \texttt{ExpressionStatement}, we know the matcher is supposed to match against an expression, and we have to remove the \texttt{ExpressionStatement} AST node from the tree used for matching. This is done by simply using \texttt{applicableTo.children[0].children[0].element} as the AST to match the users code against.
\begin{lstlisting}[language={JavaScript}]
if (applicableTo.children[0].element.type === "ExpressionStatement") {
let matcher = new Matcher(
internals,
applicableTo.children[0].children[0].element
);
matcher.singleExprMatcher(
code,
applicableTo.children[0].children[0]
);
return matcher.matches;
}
\end{lstlisting}
In \figFull[code:ExprMatcher] is the full definition of the expression matcher, it is based upon a Depth-First Search in order to perform matching, and searches from the top of the code definition. In the first part of the function, if we are currently at the first AST node of \texttt{applicable to}, we recursively try to match every child of the current code node against the full \texttt{applicable to} definition, this ensures no matches are undiscovered while not performing unnecessary searching using a partial \texttt{applicable to} definition. If a child node returns one or more matches, they are placed into a temporary array, once all children have been searched, the partial matches are filtered out and all full matches are stored. This is all done before ever checking the node we are currently on. The reason for this is to avoid missing matches that reside further down in the current branch, and also ensure matches further down are placed earlier in the full match array, which makes it easier to perform transformation when collisions exist.
From line 22 in \figFull[code:ExprMatcher] we are evaluating if the current node is a full match. First the current node is checked against the node of \texttt{applicable to}, if said node is a match, and it contains enough children to be matched against \texttt{applicable to}, we know we can perform the search. Since the list of elements in the child arrays are ordered, we can do a search on each of the nodes by using the same index in the \texttt{node.children} array. If any of the children do not return as a match, we know this is not a full match and perform an early return of \texttt{undefined}. Only if at least all the children of \texttt{applicable to} return as a match can we determine this is a match. That match is then returned, and is stored by the caller of this current iteration of the recursion.
\begin{lstlisting}[language={JavaScript}, label={code:ExprMatcher}, caption={Recursive definition of expression matcher}]
singleExprMatcher(
code: TreeNode<t.Node>,
aplTo: TreeNode<t.Node>
): TreeNode<PairedNodes> | undefined {
// If we are at start of ApplicableTo, start a new search on each of the child nodes
if (aplTo.element === this.aplToFull) {
// Perform a new search on all child nodes before trying to verify current node
let temp = [];
// If any matches bubble up from child nodes, we have to store it
for (let code_child of code.children) {
let maybeChildMatch = this.singleExprMatcher(code_child, aplTo);
if (maybeChildMatch) {
temp.push(maybeChildMatch);
}
}
this.matches.push(...temp);
}
let curMatches = this.checkCodeNode(code.element, aplTo.element);
curMatches =
curMatches && code.children.length >= aplTo.children.length;
if (!curMatches) {
return;
}
// At this point current does match
// Perform a search on each of the children of both AplTo and Code.
let pairedCurrent: TreeNode<PairedNodes> = new TreeNode(null, {
codeNode: code.element,
aplToNode: aplTo.element,
});
for (let i = 0; i < aplTo.children.length; i++) {
let childSearch = this.singleExprMatcher(
code.children[i],
aplTo.children[i]
);
if (childSearch === undefined) {
// Failed to get a full match, so early return here
return;
}
childSearch.parent = pairedCurrent;
pairedCurrent.children.push(childSearch);
}
// If we are here, a full match has been found
return pairedCurrent;
}
\end{lstlisting}
To allow for easier transformation, and storage of what exact part of \texttt{applicable to} was matched against the exact node of the code AST, we use a custom instance of the simple tree structure described in \ref*{sec:BabelParse}, we use an interface \texttt{PairedNode}, this allows us to hold what exact nodes were matched together, this allows for a simpler transforming algorithm. The exact definition of \texttt{PairedNode} can be seen below
\begin{lstlisting}[language={JavaScript}]
interface PairedNode{
codeNode: t.Node,
aplToNode: t.Node
}
\end{lstlisting}
\subsection*{Matching multiple Statements}
Using multiple statements in the template of \texttt{applicable to} will result in a much stricter matcher, that will only try to perform an exact match using a \cite{SlidingWindow}{sliding window} of the amount of statements at every \textit{BlockStatement}, as that is the only placement Statements can reside in JavaScript\cite{ECMA262Statement}.
The initial step of this algorithm is to search through the AST for ast nodes that contain a list of \textit{Statements}, this can be done by searching for the AST nodes \textit{Program} and \textit{BlockStatement}, as these are the only valid places for a list of Statements to reside \cite{ECMA262Statement}. Searching the tree is quite simple, as all that is required is checking the type of every node recursively, and once a node that can contain multiple Statements, we check it for matches.
\begin{lstlisting}[language={JavaScript}]
multiStatementMatcher(code: TreeNode<t.Node>, aplTo: TreeNode<t.Node>) {
if (
code.element.type === "Program" ||
code.element.type === "BlockStatement"
) {
this.matchMultiHead(code.children, aplTo.children);
}
// Recursively search the tree for Program || BlockStatement
for (let code_child of code.children) {
this.multiStatementMatcher(code_child, aplTo);
}
}
\end{lstlisting}
Once a list of \textit{Statements} has been discovered, the function \texttt{matchMultiHead} can be executed with that block and the Statements of \texttt{applicable to}.
This function will use the technique \cite{SlidingWindow}{sliding window} to match multiple statements in order the same length as the list of statements are in \texttt{applicable to}. This sliding window will try to match each and every Statement against its corresponding Statement in the current \textit{BlockStatement}. When matching a singular Statements in the sliding window, a simple DFS recursion algorithm is applied, which is quite similar to the second part of matching a single Statement/Expr, however the main difference is that we do not search the entire AST tree, and if it matches it has to match fully and immediately. If a match is not found, the current iteration of the sliding window is discarded and we move on to the next iteration.
\subsection*{Output of the matcher}
The resulting output of the matcher after finding all available matches, is a two dimensional array of each match, where for every match there is a list of statements in AST form, where paired ASTs from \texttt{applicable to} and the users code can be found. This means that for every match, we might be transforming and replacing multiple statements in the transformation function.
\begin{lstlisting}[language={JavaScript}]
export interface Match {
// Every matching Statement in order with each pair
statements: TreeNode<PairedNodes>[];
}
\end{lstlisting}
Using multiple statements in the template of \texttt{applicable to} will result in a much stricter matcher, that will only try to perform an exact match using a sliding window of the amount of statements at every \textit{BlockStatement}, as that is the only placement Statements can reside in JavaScript.
\section{Transforming}
\section{Generating}
To perform the transformation and replacement on each of the matches, the tool uses the function \texttt{transformer}, this function takes all the matches found with the matcher, the code from \texttt{transform to} parsed by \cite{BabelParser}{babel/parser} and built into our custom tree structure, the original user code AST, and the full direct output of \texttt{transform to} parsed by babel/parser.
To generate JavaScript from the transformed AST created by this tool, we use a JavaScript library titled \cite{BabelGenerate}{babel/generator}. This library is specifically designed for use with Babel to generate JavaScript from a Babel AST.
An important discovery is to ensure we transform the leaves of the AST first, this is because if the transformation was applied from top to bottom, it might remove transformations done on a previous iteration of the matcher. This means if we transform from top to bottom on the tree, we might end up with \texttt{a(b) |> c(\%)} in stead of \texttt{b |> a(\%) |> c(\%)} in the case of the pipeline proposal. This is quite easily solved in our case, as the matcher looks for matches from the top of the tree to the bottom of the tree, the matches it discovers are always in that order. Therefore when transforming, all that has to be done is reverse the list of matches, to get the ones closest to the leaves of the tree first.
The first step of transforming is done by taking the wildcards used by the matcher and place them into the AST generated from \texttt{transform to}, in our case that means searching \texttt{transform to} and the paired output of the matcher for \textit{Identifiers} with the same name, and inserting the AST node matched against that specific \texttt{applicable to} node into the tree of \texttt{transform to} we are transforming. The version of \texttt{transform to} we are applying the transformation to is not in the custom tree structure used by the matcher, therefore we have to use \cite{BabelTraverse}{babel/traverse} to traverse it with a custom \cite{VisitorPattern}{visitor} only applying to AST nodes of type \textit{Identifier}. Once the correct identifier is found while traversing, the node is simply replaced with the node the wildcard was matched against in the matcher.
Having a transformed version of the users code, it has to be inserted into the full AST definition of the users code, again we use \cite{BabelTraverse}{babel/traverse} to traverse the entirety of the AST using a visitor. This visitor does not apply to any node-type, as the matched section can be any type. Therefore we use a generic visitor, and use an equality check to find the exact part of the code this specific match comes from. Once we find where in the users code the match came from, we replace it with the transformed \texttt{transform to} nodes. This might be multiple Statements, therefore the function \texttt{replaceWithMultiple} is used, to insert every Statement from the \texttt{transform to} body. Now we simply have to remove the next n-1 Statements, where n is the length of the list of Statements in the current match.
To generate JavaScript from the transformed AST created by this tool, we use a JavaScript library titled \cite{BabelGenerate}{babel/generator}. This library is specifically designed for use with Babel to generate JavaScript from a Babel AST. The transformed AST definition of the users code is transformed, while being careful to apply all Babel plugins the current proposal might require.

View file

@ -25,7 +25,7 @@
\renewcommand{\labelenumii}{\theenumii}
\renewcommand{\theenumii}{\theenumi.\arabic{enumii}.}
\usepackage{hyperref}
\usepackage[hidelinks]{hyperref}
\tolerance=1000
\usepackage{amsmath}
\usepackage{titlesec}

View file

@ -63,3 +63,35 @@
note = {[Online; accessed 12. May 2024]},
url = {https://babeljs.io/docs/babel-generator}
}
@incollection{SlidingWindow,
author = {Hirzel, Martin and Schneider, Scott and Tangwongsan, Kanat},
title = {{Sliding-Window Aggregation Algorithms: Tutorial}},
booktitle = {{DEBS '17: Proceedings of the 11th ACM International Conference on Distributed and Event-based Systems}},
pages = {11--14},
year = {2017},
month = jun,
urldate = {2024-05-13},
isbn = {978-1-45035065-5},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://dl.acm.org/doi/abs/10.1145/3093742.3095107},
doi = {10.1145/3093742.3095107}
}
@misc{ECMA262Statement,
title = {{ECMAScript{\ifmmode\circledR\else\textregistered\fi} 2025 Language Specification}},
year = {2024},
month = apr,
urldate = {2024-05-13},
note = {[Online; accessed 13. May 2024]},
url = {https://tc39.es/ecma262/#sec-ecmascript-language-statements-and-declarations}
}
@misc{BabelParser,
title = {{@babel/parser {$\cdot$} Babel}},
year = {2024},
month = may,
urldate = {2024-05-14},
note = {[Online; accessed 14. May 2024]},
url = {https://babeljs.io/docs/babel-Parser}
}