Writing a Parser in TS

As developers, we use programming languages every day, but have you ever considered how they’re actually created? In this post, we’ll explore the first step in language creation by building a parser for a simple demo language using parser combinators in TypeScript.

Understanding Parser Combinators

Let’s start by understanding what parser combinators are. Essentially, parser combinators are functions that allow us to combine smaller parsers to create more complex ones. This approach enables us to build parsers in a modular and composable way, which is particularly useful when creating parsers for complex languages.

For our project, we will be using the arcsecond library, a parser combinator library designed for JS/TS.

Project Setup

To begin, let’s set up our project. Create a new directory and initialize it with npm:

mkdir demo-language-parser
cd demo-language-parser
npm init -y
npm install arcsecond typescript @types/node
npx tsc --init

These commands set up a new TypeScript project and install the arcsecond library.

Our Demo Language

For this tutorial, we’ll create a straightforward language capable of defining variables and performing basic arithmetic. Here’s an example of what our language might look like:

let x = 5
let y = 10
print x + y

Our parser will need to handle variable declarations, arithmetic expressions, and print statements.

Building the Parser

Let’s start by importing the necessary functions from arcsecond:

import {
  char,
  letters,
  digits,
  sequenceOf,
  choice,
  many,
  many1,
  str,
  spaces,
  optionalWhitespace,
  between,
} from 'arcsecond';

These functions will serve as the building blocks for our parser. Each of these functions is a parser or a combinator that we’ll use to build our language parser. For example, char creates a parser that matches a single character, while sequenceOf combines multiple parsers that must match in sequence.

Now, let’s break down our parser into smaller, manageable components.

Parsing Identifiers

First, we’ll create a parser for identifiers (variable names):

const identifier = sequenceOf([
  letters,
  many(choice([letters, digits, char('_')])),
]).map(([first, rest]) => first + rest.join(''));

This parser is more complex than it might first appear:
– We use letters to match one or more letters at the start of the identifier.
– Then, we use many combined with choice to allow any number of letters, digits, or underscores to follow.
– The sequenceOf combinator ensures these two parts occur in order.
– Finally, we use map to transform the result into a single string.

This approach ensures our identifiers start with a letter and can contain letters, numbers, and underscores afterward, which is a common convention in many programming languages.

Parsing Numbers

Next, let’s create a parser for numbers:

const number = many1(digits).map(Number);

This parser is simpler:
many1 ensures we match one or more digits.
– We then use map to convert the string of digits into an actual number.

This parser will handle integer numbers in our language. If we wanted to extend it to handle floating-point numbers, we’d need to modify it to optionally include a decimal point and more digits.

Parsing Expressions

Now, let’s tackle parsing basic arithmetic expressions:

const expression = choice([
  number,
  identifier,
  between(char('('), char(')'))(lazy(() => expression)),
]).chain((left) =>
  many(
    sequenceOf([
      optionalWhitespace,
      choice([char('+'), char('-'), char('*'), char('/')]),
      optionalWhitespace,
      lazy(() => expression),
    ])
  ).map((rights) =>
    rights.reduce(
      (acc, [, operator, , right]) => ({ type: 'BinaryExpression', operator, left: acc, right }),
      left
    )
  )
);

This parser is the most complex part of our language:
– We start with choice, allowing an expression to be a number, an identifier, or a parenthesized expression.
– The between combinator handles parentheses, ensuring expressions can be nested.
– We use lazy to handle recursive expressions, allowing for arbitrary nesting depth.
– The chain method allows us to handle binary operations. It takes the left side of an expression and then looks for operators and right-hand expressions.
– We use many to allow multiple operations in a single expression, like 1 + 2 * 3.
– Finally, we use reduce to build up the AST for the expression, creating a tree of binary operations.

This parser can handle complex expressions like (x + 5) * (y - 3), building them into an AST that represents the structure and precedence of the operations.

Parsing Statements

Now, let’s create parsers for our statements:

const letStatement = sequenceOf([
  str('let'),
  spaces,
  identifier,
  optionalWhitespace,
  char('='),
  optionalWhitespace,
  expression,
]).map(([, , name, , , , value]) => ({ type: 'LetStatement', name, value }));

const printStatement = sequenceOf([
  str('print'),
  spaces,
  expression,
]).map(([, , value]) => ({ type: 'PrintStatement', value }));

const statement = choice([letStatement, printStatement]);

Here we define two types of statements:
letStatement parses variable declarations. It looks for the ‘let’ keyword, an identifier, an equals sign, and an expression.
printStatement parses print statements, looking for the ‘print’ keyword followed by an expression.
– We use sequenceOf to ensure these elements appear in the correct order, with appropriate spacing.
– The map function at the end of each parser transforms the parsed elements into an AST node.

Parsing the Program

Finally, let’s create a parser for the entire program:

const program = many(sequenceOf([optionalWhitespace, statement, optionalWhitespace])).map(
  (statements) => ({ type: 'Program', statements: statements.map(([, stmt]) => stmt) })
);

This parser brings everything together:
– We use many to allow multiple statements in a program.
– Each statement can be surrounded by optional whitespace.
– The map function transforms the result into a ‘Program’ node containing an array of statements.

Using Our Parser

Now that we’ve built our parser, let’s use it to parse a more complex program that utilizes all the features we’ve implemented:

const input = `
let x = 5
let y = 10
let z = (x + y) * 2
print x + y
print z - 15
print (z + x) / (y - 3)
`;

const result = program.run(input);
console.log(JSON.stringify(result, null, 2));

This input program demonstrates:
1. Simple variable declarations (x and y)
2. A more complex variable declaration involving arithmetic and parentheses (z)
3. A simple print statement with addition
4. A print statement with subtraction
5. A complex print statement with nested arithmetic operations and parentheses

Let’s break down what our parser does with this input:

1. It parses each line as a separate statement.
2. For let statements, it recognizes the variable name and the expression assigned to it.
3. For print statements, it parses the expression to be printed.
4. In expressions, it handles:
– Simple numbers and variables
– Binary operations (addition, subtraction, multiplication, division)
– Parentheses for grouping and precedence

The output will be an Abstract Syntax Tree (AST) representing our program. Here’s a simplified version of what the AST might look like:

{
  "type": "Program",
  "statements": [
    {
      "type": "LetStatement",
      "name": "x",
      "value": {"type": "Literal", "value": 5}
    },
    {
      "type": "LetStatement",
      "name": "y",
      "value": {"type": "Literal", "value": 10}
    },
    {
      "type": "LetStatement",
      "name": "z",
      "value": {
        "type": "BinaryExpression",
        "operator": "*",
        "left": {
          "type": "BinaryExpression",
          "operator": "+",
          "left": {"type": "Identifier", "name": "x"},
          "right": {"type": "Identifier", "name": "y"}
        },
        "right": {"type": "Literal", "value": 2}
      }
    },
    {
      "type": "PrintStatement",
      "value": {
        "type": "BinaryExpression",
        "operator": "+",
        "left": {"type": "Identifier", "name": "x"},
        "right": {"type": "Identifier", "name": "y"}
      }
    },
    {
      "type": "PrintStatement",
      "value": {
        "type": "BinaryExpression",
        "operator": "-",
        "left": {"type": "Identifier", "name": "z"},
        "right": {"type": "Literal", "value": 15}
      }
    },
    {
      "type": "PrintStatement",
      "value": {
        "type": "BinaryExpression",
        "operator": "/",
        "left": {
          "type": "BinaryExpression",
          "operator": "+",
          "left": {"type": "Identifier", "name": "z"},
          "right": {"type": "Identifier", "name": "x"}
        },
        "right": {
          "type": "BinaryExpression",
          "operator": "-",
          "left": {"type": "Identifier", "name": "y"},
          "right": {"type": "Literal", "value": 3}
        }
      }
    }
  ]
}

This AST captures the structure of our program, including variable declarations, arithmetic expressions, and print statements. It represents nested expressions as nested objects, preserving the order of operations.

To further test our parser, we could try some edge cases:

const edgeCases = `
let _underscore123 = 42
let emptyParens = ()
print 1 + 2 + 3 + 4 + 5
print ((((1 + 1))))
`;

const edgeCaseResult = program.run(edgeCases);
console.log(JSON.stringify(edgeCaseResult, null, 2));

This tests:
1. Variable names with underscores and numbers
2. Empty parentheses
3. A long chain of additions
4. Deeply nested parentheses

Our parser should handle these cases correctly, demonstrating its robustness and flexibility.

By running these examples through our parser, we can see how it handles various language constructs and expressions, validating that our implementation correctly captures the structure and semantics of our simple language.

Wrapping Up

This is the first step in creating a programming language by building a parser using parser combinators. This approach allows us to make complex parsers from simple, reusable components. While parsing is just the beginning of creating a language, it’s a crucial step that lays the foundation for everything else.

Creating a language is a complex but rewarding task that can enhance your understanding of how programming languages work. You might want to continue adding features to the language, and the modular nature of parser combinators will definitely make extending your parser a manageable process.

関連記事

カテゴリー:

ブログ

情シス求人

  1. チームメンバーで作字やってみた#1

ページ上部へ戻る