For years I’ve wanted to create my own programming language. Recently I took the time to do so, and a few weeks ago the project reached a milestone: The compiler builds a non-trivial program – and it’s fast! Before that I’d built a simple interpreter for the same language. This is a collection of my thoughts on planning a personal programming language project for others who are just starting out.

A lot of us programmers dream of making our own language just because it’s cool. We’d like to design particular bits of syntax the way we think a language ought to work, or maybe we want to design a domain-specific language for text adventures or similar.

What’s the Goal?

Once you get past the pure daydream stage you have to actually make choices about your implementation: Interpreter or compiler? JIT or AOT? Dynamic or static typing? What language will you use to create your language? Do you use an existing backend.

Beyond these technical questions, what future do you see for your language? There’s nothing wrong with doing for the sake of learning. Making something other programmers can and will use is quite a different thing: The bar keeps going higher for what’s expected.

Minimum Viable Product?

Language implementers now recognize that modern languages require a good tooling and code ecosystem, not just a good compiler or interpreter.

To get wide adoption, you need a good way to run and build programs and share library code (For compiled languages, see Rust’s ‘Cargo’ tool, Crystal’s ‘crystal’ tool or Zig’s build tools.) Compiled languages largely solve the running and dependencies problem since they produce binaries: Go goes so far as to package all required libraries into an executable file as the standard result of a build with no dynamic linking at all.

Whether compiled or interpreted, you need a good way to pull in libraries during development: rubygems.org, PIP (the Python Package Index,) etc. The Go, D and Nim languages also have these build and library facilities. Rust’s Cargo tool manages the builds and pulls in any needed libraries specified by the build.

Interpreted languages also need ways to distribute and run programs and run with the required dependencies. Ruby has Bundler. Python has Virtualenv, Poetry, or there is the whole Conda ecosystem to choose from.

Some basic IDE support is usually expected but most new languages can get away without it at first.

Users expect nice, comprehensive documentation more these days than twenty years ago.

IN short, people expect a fully-formed language ecosystem and community.

Mainstream established languages don’t necessarily have a concensus on tooling and library distribution: Java and moreso C and C++ have many options for libraries and build approaches (Maven for Java and CMake for C++ are nowhere near as nice as Cargo, for instance.) They “get away” with it because they have huge communities and heaps of legacy code out in the world.

To decide to pick up something new, programmers want to know at the least they won’t have to suffer with a fragmented ecosystem and inferior tooling.

And all this is necessary but not sufficient. You’ll find quite a few extremely nice well developed languages you’ve never heard of. Or, you’ve heard of them but nobody at your company has heard of it and won’t touch it no matter how nice.

So, just keep in mind building a new language others will enthusiastically adopt requires lots of ambition, time and energy not to mention skill. You can’t go it alone either. I haven’t even touched on marketing and promoting the use of your language.

With that out of the way, I assume you just can’t stop yourself and you still plan on making your own language. That’s cool, I’m doing it too. You don’t need to build out the whole ecosystem as long as you recognize it will stay a hobby project.

Technical Considerations

You will need to consider these aspects of the language project:

  • Interpreter or Compiler
  • Static or Dynamic Typing
  • Implementation language for your interpreter or compiler
  • Backend

Interpreter or Compiler

Most modern interpreters produce instructions for an imaginary machine or “virtual machine.” Generating the instructions – “bytecode” – requires a compilation step.

The distinction I draw between compilers and interpreters is a practical one: True compilers produce executables, while interpreters compile in memory and invoke their vm immediately. Compiler versus Interpreter is a classification of systems on a spectrum. For instance, both Python and Java run compiled bytecode on a virtual machine. Python will cache the bytecode to reuse later without recompiling if source code hasn’t changed; the Java compiler will make “.class” files containing bytecode that typically get bundled up into “.jar” files for distribution. The “.jar” files can be executed on the Java Virtual Machine (JVM.) The standard way to run Java apps doesn’t require compilation on the user’s system; the compiler and JVM are distributed separately and most users don’t have a Java compiler installed.

Java still counts as a compiled language in my book. Compilation and execution are totally separate. Java programs get distributed separately from the JVM often. You still have the requirement to have a Java runtime to use a Java application. Further along the spectrum, native binaries made by C++ compilers, and Rust need only the most fundamental libraries installed on the user systems (glibc), and Go produced binaries not even that.

The choice is, do you want to require users to install your language in order to run programs in your language, or do you prefer programs in your language can be distributed as normal executables, with all dependencies included? Interpreted languages don’t normally provide ways to bundle programs with all dependencies, since there’s no “executable” product as such. (There are many tools to bundle Python applications, for example, and they mostly work but are a whole added layer of complexity.)

This might make it seem like compilers are always a better approach, but if you want to create a scripting language to embed in a larger system a compiler wouldn’t make much sense; you want users to be able to make script files and have the system execute them with no extra steps required to prepare the scripts.

Performance is a big consideration as well, with interpreted languages normally running ten to one hundred times slower than binary executables. However, it’s possible for an interpreter to internally compile to high-performance code with a JIT compiler (see LuaJIT) Also it is possible to simply use a very high-performance virtual machine and design your language to avoid expensive-to-interpret features. And on the other hand, a binary executable format might not be more than saved bytecode and a vm packaged together, which wouldn’t necessarily run any faster than a typical interpreter, though you’d still benefit from the packaging convenience.

Static or Dynamic Types

Usually interpreted languages eschew static typing. but there’s no reason that has to be the case. It’s just that, having designed a statically typed language, you might as well compile to high-performance native code. The interpreted version wouldn’t run nearly as fast.

If you wan’t to design a language with dynamic typing and meta-programming features an interpreter is going to be much easier than a compiler, at least if you want the compiler to make fast executables that don’t crash all the time.

A statically typed language will be hard to implement if the types don’t conform to fairly standard types. For maximum flexibility in your design you’d want to target a virtual machine custom built to support your language.

In general a dynamically typed language will be easier to implement. You just have to add lots of runtime type checking.

Language Choice for the Implementation

Interpreters

For any at all serious effort you need to pick a language that makes fast programs: C++, C, Rust, , perhaps Java or C-Sharp. Runners-up would be Go, D, Nim, Crystal, Scala Native, Common Lisp.

Some languages will make writing an interpreter rather difficult. Rust for instance with its strict ownership model makes managing program state in an interpreter painful, especially if you care about performance. It’s possible, just not the easy path. As I was learning Rust I made two tree-walk interpreters and investigated building a register-based VM interpreter. It wasn’t easy.

C++ won’t tie your hands like Rust will but you get all the well-known short-comings. Many bugs Rust is meant to prevent won’t be a problem in practice, and C++ is much easier to make fairly safe self-referential data structures in, which you’ll need. On the other hand with Rust I was able to hand code parsers with very little debugging once they compiled, which wasn’t my experience with C++.

If you know modern C++ farily well I’d stick with it. Otherwise Java or Scala could be an interesting choice. With the GrallVM you can make stand-alone binaries out of your Java “.jar” compiler output, so distributing your language wouldn’t require the JVM. You could still use Java libraries in your project. The Grall VM is an ahead-of-time compiler for JVM bytecode: You feed it “.jar” files and it can make an executable image.

Compilers

For pure compiler projects, I think Rust is a really good choice. You get the power of the Rust “enum” type: Pattern matching and data storage. You can take advantage of Rust’s functional aspects: We aren’t needing a lot of self referential data structures or mutable state. Instead, we’re transforming an input through several transformation steps into a different output (the final or intermediate representation of the program.) You don’t have to worry about performance at all and the type-checker will save you hours of debugging.

Really any reasonably strictly-typed language will work fine. Since a pure compiler doesn’t require as much of the high-performance or mutable state C++ allows, I’d steer clear of using C++ unless it’s your best language already.

Backend: Compiler Output

What will your compiler produce? You can output machine instructions directly, but then you’re limited to that one type of machine. Usually compilers emit IR, (intermediate representation) of the program. This can be assembly code you send to an assembler, but can also be something more generic you send to a “backend” compiler tool which will optimize, assemble and link your IR into a working executable.

You could generate IR as text and invoke the final compilation and linking afterward on it. For instance the QBE (Quick Backend) tool accepts text of its IR language and produces assembly code which you can link into a working executable. It doesn’t have its own assembler and linker but does optimizations on the IR and produces different flavors of assembly.

One type of IR is just another programming language: For instance Nim takes this approach. It compiles to C or JavaScript. To create a working executable the compiler has to generate IR( C) and call a C compiler and linker automatically.

While compiling to C has drawbacks affecting your language design somewhat you get two big advantages: Portability and speed. Most C compilers will heavily optimize their output, and most platforms have some C compiler implemented for them. Also you can include lots of C libraries to support your run-time, and can probably easily interoperate with other languages that can already interface with libraries written in C.

Or, you could consider generating code using a code generation library. The Cranelift project is a Rust library to JIT (Just-in-Time) compile code. Here’s a toy language implementation example. The LLVM project is the best-known way to generate code to an intermediate representation and then generate executables. It’s a complex subject to learn while also making your first language.

What’s Next?

While developing my language I thought over the different approaches discussed here, weighing their pros and cons. Sometime soon I may post some more details of how I built parts of the compiler. But briefly:

I ended up writing an interpreter and compiler using Rust, with the compiler targeting generic C. I use TCC (the Tiny C Compiler) to compile, link and run my generated code very quickly. The same code builds with gcc. You would use gcc instead if you need highly optimized executables, which I don’t require for testing the compiler output.

Eventually I removed the interpreter because it was too hard to maintain in parallel with the compiler. The interpreter in Rust made some new language features difficult to implement while having the time to also make the compiler for the same features and keep the behavior identical. If I was beginning the project now I’d make some different design choices with the interpreter code, but at the time I was just learning Rust. A few things were over complicated like the token types.

The parser is a hand-crafted recursive descent style parser, producing an abstract syntax tree that can be type-checked and use to generate code and collect errors, and recover from some errors.

The language itself is a statement language like Go or Python, fairly old-fashioned in some ways.

My goal was to

  • learn to make a language that’s more than a toy
  • Address well-known shortcomings of similar languages
  • Add a few modern features not seen in Go
  • As much type safety as possible while keeping the language very easy to use
  • Strongly encouraged but not mandated immutable values, and minimize side-effects

At this point the language has immutable and mutable variables, and functions as described; arrays are nearly working. I’ve got integer, float, bool and string basic types, and records (structs) are in the works along with set types and enumerations. Currently only “if” and “while” statements control flow. Once enums , arrays and sets are done I’ll add a “for each” sort of statement and a “case” statement for matching enums and numbers.

User defined strong types (newtypes) are in the planning stage. I haven’t decided the best way to add them.

If I get that done, next I’ll add a module system.

This is a flavor of what the code looks like:

fun fib(n: int): int {
	if n = 0 or n = 1 { return n }
	return fib(n - 1) + fib(n - 2)
}

val n = 8
print "Fib of ", n,": ", fib(n)

Newlines can terminate a statement; semicolons can separate statements on the same line.

You declare variables as mutable with “var” or immutable with “val”.

“val” variables can’t be modified inside functions taking them as arguments.

Function parameters are immutable by default. Their values can’t be changed inside the function. You modify a parameter with “var” to allow it to be changed inside a function and to have that change reflected outside the function. To allow changes to a parameter inside a function, but prevent side-effects you modify the parameter with “cpy” to make an internal temporary copy.

Passing a “val” variable as an argument to a “var” parameter is a compile error.

// Here 'a' can be modified in the function and that change will be seen in the outer scope.
// 'b' can't be assigned a new value in the function.
// 'c' can have a new value assigned inside the function but since it is a copy the outer scope 'c' won't change.
fun math(var a: int, b: int, cpy c: int): int {
	val d = 11
	a := d + 9
	c := b + a * d
	while a < b {
		a := b + 1		
	}
	return a + b + c
}