traversing an abstract syntax tree in typescript

Posted on:October 27, 2023

Not to long ago, I set out on a task of creating a language server for the fish-shell. Before then I felt rather comfortable in typescript, but I thought this would be a great a project to help strengthen my overall understanding of the language. All of my experience in typescript previous to this project was building web and mobile applications. However, I choose to use it for this project because of the widespread popularity of other language servers that have been written in typescript. My closest previous experience to building an LSP was writing a compiler in C in my final year of university. Before beginning this monumentous task, I wondered why so many LSP implementations had been built using typescript, mostly because of the performance bottlenecks it suffers from (compared to languages like rust). Never the less, I enjoyed using typescript in my previous experiences, and the job market looked very interested in hiring typescript developers at the time, so gaining as much experience using it as could became my goal while beginning on this project. After extensively using typescript in this setting, one of the major insights I learned while working on this LSP was understanding how to write cleaner typescript. In my experience, the verbosity that is provided by the self-documenting type system provided by typescript can be a double edged sword (especially when working on larger projects). Without proper understanding of typescript’s type system, code-bases quickly can become spaghettified messes.

The major focus of this article, will be to generically write pure functions to manipulate collections of objects. I will discuss signs and indications of when this approach would be benefical for any project. While, the context of the code shared here uses packages relating to the language-server-protocol, no prior knowledge of LSP’s is expected to understand the concepts discussed.

In my own experience, learning how to fully take advantage of mixing functional and OOP design into a project, is one of the greatest strengths the language typescript offers to developers.

Table of contents

Open Table of contents

Setting up

Typescript and Javascript as a whole for that matter, rely heavily on importing external packages, allowing for developers to inject these dependencies into their projects. A major reason for the widespread adoption of Typescript, was because without it’s typing system, injecting external dependencies of dynamically typed javascript into codebases created confusion. While, the static typing provided by Typescript does of course directly solve this issue, it also helps clarify recurring properties among different types of objects throughout a code base. Good typescript developers, are able to use different packages with code containing objects of similar “shape” to their advantage. Through the use of generics, the overlapping properties of different types can be used to create clear functions reducing significant amounts of code.

In the context of a project relating to an LSP, most implementations use the following npm packages:

Due to the popularity of these packages, I find it fitting to use them as examples for how build, re-useable, maintaible, and testable code.

Brief introduction to an LSP’s structure

Without building an entire LSP in this article I will briefly describe the way the previously mentioned packages are used to get an LSP initially working. The vscode-languageserver package defines the protocol used between an editor/IDE/client and a language server that provides language features like auto complete, go to definition, find all references, and more.

For any of those feature’s to be provided, an LSP needs some form of a cache to store which files are currently relevant. This is typically refered to as a workspace in most code editors, and the LSP stores the collection of files by URI (the path to the file). If the LSP aims to provide a rich feature set, then it typically needs the ability to parse through every single file relevent to the current workspace. To handle the parsing and tokenization of languages, using the web-tree-sitter package makes this simple and fast. The text is parsed into an AST for the langauge which the LSP/parser are used on. The nodes from the AST will then be used in various parts of our LSP’s providers. The LSP’s providers are the handler’s which provide your client/editor, with feature’s previously mentioned. However, since the nodes are parsed from the web-tree-sitter package, they often need to be converted into data-structures that the vscode-languageserver protocol can understands and use.

At this step, we will have an AST of SyntaxNode[] from the parser which can be seen in the tree-sitter typing file HERE

Later in our LSP we will want to use this AST to implement different providers for the LSP. The providers are the features our coding editors use such as: diagnostics, completions, or hovering. There is tons of examples of tree-like objects in the vscode-languageserver package, but simplicity we’ll just focus on DocumentSymbol[] seen HERE.

If you’re unfamiliar with the DocumentSymbol data type that is recommended in the LSP’s documentation, it would typically be used as the data-structure to hierachally store the important pieces of code that our LSP uses (things like identifiers). For the purposes I am outlining here, you can assume our LSP’s DocumentSymbol[] array will just be some special subset of our SyntaxNode[] parsed tree.

At a very high level, we can generalize that our LSP is storing the context to both of these nested arrays for each file it indexes. It is also handling updates to the files indexed, for when changes have been made to them. Effectively, both of these arrays are need to analyze files, so that our LSP can provide any of the various cool features the actual protocal describes. To see an example of what this looks like in some actual typescript code, I’ll begin by defining a class named Analyzer.


/** Values to store in Analyzer.cache */
type CachedFile = {
    version: number;
    symbols: DocumentSymbol[];
    nodes: SyntaxNode[];
}

/**
 * Analyzer is an implementation of the facade pattern, which is typically used when:
 *  • an entry point is needed to each level of layered software, or
 *  • the abstractions and implementations of a subsystem are tightly coupled.
 *
 * The instatiated analyzer passed around throughout our LSP, will use this as
 * an entry point to different data-structures stored in the `cache` property
 * values.
 *
 * Different subsystems of our LSP are tightly coupled with specific methods
 * of the Analyzer. Furthermore, the LSP's providers each have a unique procedural
 * ordering to the methods they call.
 *
 * For developer's looking to work on existing LSP's, or ones trying to build
 * their own, I would begin with getting a foothold on how this classes'
 * functionality will be handled and tested.
 */
export class Analyzer {

    private cache: {[uri: string]: CachedFile}
    private parser: Parser = initializeParserForOurLanguage()

    constructor() {}

    analyze({uri: string, content: string, version: number}: LspDocument) {
        const { rootNode } = this.parser.parse(fileContent)
        const symbols = collectDocumentSymbols( rootNode );
        const nodes = [rootNode]
        const analyzedFile: CachedFile = {
            version,
            nodes,
            symbols,
        }
        this.cache[uri] = analyzedFile;
    }

    /**
     * Assume that analyze has already been called on the current uri
     */
    hasFile({uri: string, version?: number}: Partial<LspDocument>) {
        const current = this.cache[uri]
        if (!current) return false;
        if (current && version && current.version !== version) return false
        return true
    }

    getCachedFile({uri: string, content: string, version: number}: LspDocument): CachedFile | null {
        const current = this.cache[uri]
        return current ?? null;
    }

    getNodes({uri: string, content: string, version: number}: LspDocument): SyntaxNode[] {
        const file = this.getCachedFile({uri, content, version})
        return file.symbols ?? []
    }

    getSymbols({uri: string, content: string, version: number}: LspDocument): DocumentSymbol[] {
        const file = this.getCachedFile({uri, content, version})
        return file.symbols ?? []
    }
}

In an actual LSP implementation, I would recommend decoupling the LSP analysis from the actual LSP providers. As seen above, creating an Analyzer object which implements methods focusing on single-responsibility principles, allows for the LSP’s providers to use these methods as building blocks to achieve their expected behavior. When adding feature’s to my LSP’s, I like to treat the methods on my Analyzer class as a modular collection of abstract functionality retrieved from an LspDocument. Breaking the LSP’s providers down into step by step procedures, really helps me demystify what the LSP actually needs to do to achieve an entire feature described by the protocal. However, if the analysis of documents inside your LSP implementation are not very important to the overall functionality of the project, passing a default Map of the parsed AST and uri’s could be used instead. However the contextual data for your LSP is being handled, it is of upmost importance that we are able to extend and test it’s behaviors rigorously.

Manipulating our ‘Analyzer’ cache

A general high-level recap of what we have done thusfar, is create a project in typescript which imports multiple packages to use throughout a large scale project. While the context of this write up uses the premise of building an LSP to showcase, how it taught me to approach writing cleaner typescript in general

Writing Generics for ways to Manipulate our Nested Nodes

As seen in the Analyzer class above, there is many different interfaces provided by the vscode-langaugeserver package, that overlap with the web-tree-sitter’s SyntaxNode. In our class, we have the cache property storing multiple different nested arrays of composite objects. Using generics in typescript’s typing system, can be used to take advantage of the shape of these overlapping types, so we don’t have to implement. Below, we show various functions that implement this behavior, keeping our codebase as concise as possible:

export interface NodeWithChildren {
  children: this[];
}

export interface NodeWithParent {
  parent: this | null;
}


export function* nodesGen<T extends NodeWithChildren>(
  ...nodes: T[]
): Generator<T> {
  for (const node of nodes) {
    if (node) yield node
    if (node && node.children) yield* nodesGen(...node.children)
  }
}

export function flatten<T extends NodeWithChildren>(nodes: T[]): T[] {
  const stack: T[] = structuredClone(nodes)
  const result: T[] = []
  while (stack.length) {
    const top = stack.shift()
    if (top) result.push(top)
    if (top && top.children) stack.unshift(...top.children)
  }
  return result
}

export function flattenInOrder<T extends NodeWithChildren>(
  nodes: T[]
): T[] {
  const stack: T[] = structuredClone(nodes)
  const result: T[] = [];
  while (stack.length) {
    const top = stack.shift()
    if (top) result.push(top)
    if (top && top.children) stack.push(...top.children)
  }
  return result
}

export function filterChildren<T extends NodeWithChildren>(
  nodes: T[],
  predicate: (node: T) => boolean
): T[] {
  const stack: T[] = structuredClone(nodes)
  const result: T[] = []
  while (stack.length) {
    const top = stack.shift()
    if (top && predicate(top)) result.push(top)
    if (top && top.children) stack.unshift(...top.children)
  }
  return result
}

export function some<T extends NodeWithChildren>(
  nodes: T[],
  predicate: (node: T) => boolean
): boolean {
  // ...
}

export function every<T extends NodeWithChildren>(
  nodes: T[],
  predicate: (node: T) => boolean
) {
  // ...
}

export function forEach<T extends NodeWithChildren>(
  nodes: T[],
  callbackfn: (node: T) => void
): void {
  // ...
}

export function findParent<T extends NodeWithParent>(node: T, filter: (T) => boolean): T | null{
  let current = node;
  while (current.parent) {
    if (filter(current.parent)) return current.parent
    current = current.parent
  }
  return null
}

export function findAllParents<T extends NodeWithParent>(
    nodes T,
    filter: (T) => boolean
) : T[] {
  let current = node;
  const result: T[] = []
  while (current.parent) {
    if (filter(current.parent)) result.push(current.parent)
    current = current.parent
  }
  return result
}

Tip: wrapping the above functions in a namespace, helps provide clarity if you intend to keep the generic function declaration names. To see some of these functions in action, here’s a ts-playground link.

Exploring the Benefits of Using Generics

Using generics to manipulate packages that have ovelapping types, and similar use cases, produces’ a vast array of benefits. The first and most obvious benefit, is that it reduces the amount of code we have to write. Building off of this point, when using generics in functional setting, they encourage the creation of pure functions. Pure functions are functions that have no side effects, and always return the same output given the same. In the context of our LSP, we can use these functions to manipulate our nested arrays of nodes, without having to worry about mutating the original array. Furthermore, testing these functions becomes much simpler, because as long as we can provide a consistent input (extending the generic base type of the function), we can expect a consistent output.

Having a suite of utilities to handle nested “tree-like” structures in an implementation like an LSP, as we increase supported behavior of our server. For example, if we wanted to add a provider that handles folding code blocks to our server, most of this could be done by re-using the functions we have already defined. While the current language server protocol doesn’t require FoldingRange objects to be nested, having generic utility functions to manipulate nested objects, allows us to easily implement this feature. The necessary data-structure for creating a FoldingRange object, can be found HERE. In most programming languages, folded code blocks are allowed to contain other folded code blocks. In my opinion, this makes representing FoldingRange items as a nested structure the most logical approach. To further justify this design choice, I would not find it unlikely for future versions of the protocol to allow the client to use nested FoldingRange’s to create a hierachal view of the code (maybe as virtual texts, or other UI elements).

import { FoldingRange } from "vscode-languageserver";

export interface LspFoldingRange extends FoldingRange {
  startLine: number;
  endLine: number;
  kind?: FoldingRangeKind;
  children: LspFoldingRange[];
}

After defining the LspFoldingRange interface, we can add a new property to the Analyzer.cache, which will store the possible LspFoldingRange[] requested by the client. We could then build the possible folds from the SyntaxNode[] for a file, if the DocumentSymbol[] does not contain all symbol’s that are foldable. Lastly, we call the flatten function we defined earlier, to convert the LspFoldingRange[] into a flat array of FoldingRange objects, when the server needs to send the data to the client. Then the client handles the rest of the logic for toggling the folds, and displaying them in the UI.

Many other features of the LSP can be implemented in a similar fashion. The current protocol mostly relies on the client to handle structuring objects based on their Location and Range properties. However, the predecessor to the DocumentSymbol interface, used in code mentioned earlier, was the SymbolInformation interface. SymbolInformation was used to represent the same data as DocumentSymbol, but the it was not represent hierachal. It was stored as a flat array of objects whose Location/Range properties were handled by the client. As the LSP have evolved, the SymbolInformation interface was deprecated in favor of the DocumentSymbol interface, and it would not surprise me if other interfaces in the protocol follow suit.

Conclusion

Unlike strictly OOP languages (i.e., java), typescript allows for generics to be used in both OOP and functional programming styles. While I had previously used generics in OOP languages, I had never used them in a functional programming setting that also contained static typing. In my experience, being able to write functional code, remove’s a lot of the headache that comes with OOP design. Espcially when working on larger projects, where external dependencies are intertwined, writing generic pure functions that manipulate objects of similar shape makes the codebase much more maintainable. Not only does it reinforce the mindset to approach creating abstractions to decouple repetitive code. It also allows for simple testing, and the ability to extend the functionality of the codebase.

Lastly, I want to clarify that often when using new frameworks in collaboration on a project, it is very difficult to properly compose useful general solutions before you have encountered a comprehensive set of use cases. I often remind myself that practicing DRY, to early is not healthy. Premature optimization is a common pitfall for developers, and I tend to find more success when I wait to refactor code into a solution like I’ve described above. Even when I know I’m going to be refactoring a task later, waiting till I’ve written 2 or 3 implementations of the same task, helps me understand the problem better. Repeating procedure’s and then refactoring them into a more general solution, ensure’s that you’ve thought about what parameters are actually needed to account for all the different use cases. You also get the added benefit of understanding ways to test the refactored solution.

References