As developers, we use programming languages every day, but have you ever considered how they’re actually created? In this post, we’ll explore the first step in language creation by building a parser for a simple demo language using parser combinators in TypeScript.
この記事の目次
Understanding Parser Combinators
Let’s start by understanding what parser combinators are. Essentially, parser combinators are functions that allow us to combine smaller parsers to create more complex ones. This approach enables us to build parsers in a modular and composable way, which is particularly useful when creating parsers for complex languages.
For our project, we will be using the arcsecond
library, a parser combinator library designed for JS/TS.
Project Setup
To begin, let’s set up our project. Create a new directory and initialize it with npm:
mkdir demo-language-parser cd demo-language-parser npm init -y npm install arcsecond typescript @types/node npx tsc --init
These commands set up a new TypeScript project and install the arcsecond
library.
Our Demo Language
For this tutorial, we’ll create a straightforward language capable of defining variables and performing basic arithmetic. Here’s an example of what our language might look like:
let x = 5 let y = 10 print x + y
Our parser will need to handle variable declarations, arithmetic expressions, and print statements.
Building the Parser
Let’s start by importing the necessary functions from arcsecond
:
import { char, letters, digits, sequenceOf, choice, many, many1, str, spaces, optionalWhitespace, between, } from 'arcsecond';
These functions will serve as the building blocks for our parser. Each of these functions is a parser or a combinator that we’ll use to build our language parser. For example, char
creates a parser that matches a single character, while sequenceOf
combines multiple parsers that must match in sequence.
Now, let’s break down our parser into smaller, manageable components.
Parsing Identifiers
First, we’ll create a parser for identifiers (variable names):
const identifier = sequenceOf([ letters, many(choice([letters, digits, char('_')])), ]).map(([first, rest]) => first + rest.join(''));
This parser is more complex than it might first appear:
– We use letters
to match one or more letters at the start of the identifier.
– Then, we use many
combined with choice
to allow any number of letters, digits, or underscores to follow.
– The sequenceOf
combinator ensures these two parts occur in order.
– Finally, we use map
to transform the result into a single string.
This approach ensures our identifiers start with a letter and can contain letters, numbers, and underscores afterward, which is a common convention in many programming languages.
Parsing Numbers
Next, let’s create a parser for numbers:
const number = many1(digits).map(Number);
This parser is simpler:
– many1
ensures we match one or more digits.
– We then use map
to convert the string of digits into an actual number.
This parser will handle integer numbers in our language. If we wanted to extend it to handle floating-point numbers, we’d need to modify it to optionally include a decimal point and more digits.
Parsing Expressions
Now, let’s tackle parsing basic arithmetic expressions:
const expression = choice([ number, identifier, between(char('('), char(')'))(lazy(() => expression)), ]).chain((left) => many( sequenceOf([ optionalWhitespace, choice([char('+'), char('-'), char('*'), char('/')]), optionalWhitespace, lazy(() => expression), ]) ).map((rights) => rights.reduce( (acc, [, operator, , right]) => ({ type: 'BinaryExpression', operator, left: acc, right }), left ) ) );
This parser is the most complex part of our language:
– We start with choice
, allowing an expression to be a number, an identifier, or a parenthesized expression.
– The between
combinator handles parentheses, ensuring expressions can be nested.
– We use lazy
to handle recursive expressions, allowing for arbitrary nesting depth.
– The chain
method allows us to handle binary operations. It takes the left side of an expression and then looks for operators and right-hand expressions.
– We use many
to allow multiple operations in a single expression, like 1 + 2 * 3
.
– Finally, we use reduce
to build up the AST for the expression, creating a tree of binary operations.
This parser can handle complex expressions like (x + 5) * (y - 3)
, building them into an AST that represents the structure and precedence of the operations.
Parsing Statements
Now, let’s create parsers for our statements:
const letStatement = sequenceOf([ str('let'), spaces, identifier, optionalWhitespace, char('='), optionalWhitespace, expression, ]).map(([, , name, , , , value]) => ({ type: 'LetStatement', name, value })); const printStatement = sequenceOf([ str('print'), spaces, expression, ]).map(([, , value]) => ({ type: 'PrintStatement', value })); const statement = choice([letStatement, printStatement]);
Here we define two types of statements:
– letStatement
parses variable declarations. It looks for the ‘let’ keyword, an identifier, an equals sign, and an expression.
– printStatement
parses print statements, looking for the ‘print’ keyword followed by an expression.
– We use sequenceOf
to ensure these elements appear in the correct order, with appropriate spacing.
– The map
function at the end of each parser transforms the parsed elements into an AST node.
Parsing the Program
Finally, let’s create a parser for the entire program:
const program = many(sequenceOf([optionalWhitespace, statement, optionalWhitespace])).map( (statements) => ({ type: 'Program', statements: statements.map(([, stmt]) => stmt) }) );
This parser brings everything together:
– We use many
to allow multiple statements in a program.
– Each statement can be surrounded by optional whitespace.
– The map
function transforms the result into a ‘Program’ node containing an array of statements.
Using Our Parser
Now that we’ve built our parser, let’s use it to parse a more complex program that utilizes all the features we’ve implemented:
const input = ` let x = 5 let y = 10 let z = (x + y) * 2 print x + y print z - 15 print (z + x) / (y - 3) `; const result = program.run(input); console.log(JSON.stringify(result, null, 2));
This input program demonstrates:
1. Simple variable declarations (x
and y
)
2. A more complex variable declaration involving arithmetic and parentheses (z
)
3. A simple print statement with addition
4. A print statement with subtraction
5. A complex print statement with nested arithmetic operations and parentheses
Let’s break down what our parser does with this input:
1. It parses each line as a separate statement.
2. For let
statements, it recognizes the variable name and the expression assigned to it.
3. For print
statements, it parses the expression to be printed.
4. In expressions, it handles:
– Simple numbers and variables
– Binary operations (addition, subtraction, multiplication, division)
– Parentheses for grouping and precedence
The output will be an Abstract Syntax Tree (AST) representing our program. Here’s a simplified version of what the AST might look like:
{ "type": "Program", "statements": [ { "type": "LetStatement", "name": "x", "value": {"type": "Literal", "value": 5} }, { "type": "LetStatement", "name": "y", "value": {"type": "Literal", "value": 10} }, { "type": "LetStatement", "name": "z", "value": { "type": "BinaryExpression", "operator": "*", "left": { "type": "BinaryExpression", "operator": "+", "left": {"type": "Identifier", "name": "x"}, "right": {"type": "Identifier", "name": "y"} }, "right": {"type": "Literal", "value": 2} } }, { "type": "PrintStatement", "value": { "type": "BinaryExpression", "operator": "+", "left": {"type": "Identifier", "name": "x"}, "right": {"type": "Identifier", "name": "y"} } }, { "type": "PrintStatement", "value": { "type": "BinaryExpression", "operator": "-", "left": {"type": "Identifier", "name": "z"}, "right": {"type": "Literal", "value": 15} } }, { "type": "PrintStatement", "value": { "type": "BinaryExpression", "operator": "/", "left": { "type": "BinaryExpression", "operator": "+", "left": {"type": "Identifier", "name": "z"}, "right": {"type": "Identifier", "name": "x"} }, "right": { "type": "BinaryExpression", "operator": "-", "left": {"type": "Identifier", "name": "y"}, "right": {"type": "Literal", "value": 3} } } } ] }
This AST captures the structure of our program, including variable declarations, arithmetic expressions, and print statements. It represents nested expressions as nested objects, preserving the order of operations.
To further test our parser, we could try some edge cases:
const edgeCases = ` let _underscore123 = 42 let emptyParens = () print 1 + 2 + 3 + 4 + 5 print ((((1 + 1)))) `; const edgeCaseResult = program.run(edgeCases); console.log(JSON.stringify(edgeCaseResult, null, 2));
This tests:
1. Variable names with underscores and numbers
2. Empty parentheses
3. A long chain of additions
4. Deeply nested parentheses
Our parser should handle these cases correctly, demonstrating its robustness and flexibility.
By running these examples through our parser, we can see how it handles various language constructs and expressions, validating that our implementation correctly captures the structure and semantics of our simple language.
Wrapping Up
This is the first step in creating a programming language by building a parser using parser combinators. This approach allows us to make complex parsers from simple, reusable components. While parsing is just the beginning of creating a language, it’s a crucial step that lays the foundation for everything else.
Creating a language is a complex but rewarding task that can enhance your understanding of how programming languages work. You might want to continue adding features to the language, and the modular nature of parser combinators will definitely make extending your parser a manageable process.
カテゴリー: