A simple stack-based virtual machine in C++ with a Forth like programming language
This project contains
The instructions are fixed-width at 32-bits and so are the arithmetic
operands.
By default, programs have 1 million cells available for both program text
and data. This means that a virtual machine memory takes up 4MB plus the
data and instruction stacks.
The text and data regions are overlapped, so you can easily write
self-modifying code (early versions actually required self-modification to
be able to return from subroutine calls, just like Knuth’s MIX, but I’ve
since taken liberty to add such modern convenience into the core instruction
set).
There are no registers. This is a stack machine, after all.
As we know from theoretical computer science, a pushdown automaton needs
two stacks to be Turing equivalent. Therefore we employ two as well; one
for the instruction pointer and one for the data. They live separately from
the text and data region, and are only limited by the host process heap
size.
The machine contains no special facilities besides this: It’s inherently
single-threaded and has no protection mechanisms. Its operation is
completely sandboxed, though, except for access to standard output.
The project aim was to create a simple machine and language to play around
with. You can benefit from it by reading the source code, playing with a
language similar to Forth, but conceptually simpler, and finally by seeing
how easy it is to build your own system.
The language is very similar to Forth and PostScript: You basically write
in RPN — reverse Polish notation. Anything not recognized as an
instruction is put on the data stack, so to put the numbers 3 and 2 on the
stack, just write
3 2
To multiply them, just append with an asterix:
3 2 * ; multiplication
This operation pops the topmost two numbers on the stack and replaces them
with the result of the multiplication. To run such a program, you’d need to
include the core library first, since multiplication is defined as a
function:
$ cat tests/core.src your-file.src | sm
6
Labels are identifiers ending with a colon.
They refer to a particular cell in the machine, and you can access their
position, value or execute code from that cell location:
label: ; create a label for the cell at this location
&label ; put ADDRESS of label on top of stack
&label LOAD ; put VALUE of label's cell "label" on top-of-stack
label ; EXECUTE code from label position
So, to put the address of a label on the top of the data stack, just
prepend the label name with an ampersand.
If you want the value of an address, put the address on the TOS (top of
stack) and use the LOAD
instruction to replace the TOS with the value at
the given cell location.
When executing code at a given label position, the machine first puts the
address of the next instruction on top of the instruction stack. This way
you can return from a function call by using the instruction POPIP
:
main: ; program start
print-dot
print-dot
HALT
print-dot:
'.' OUT
'\n' OUT
POPIP ; return from "function"
An idiom for creating variables is to create labels and putting a NOP
at
that location to reserve one memory cell to hold variables. An example of
using a counter variable to implement a loop is given below.
counter: NOP ; reserve 1 word for the variable "counter"
program: 2 &counter STOR ; set counter to two
&counter LOAD 1 ADD &counter STOR ; increment counter by one
; loop counter+1 times
display: '\n' '*' OUT OUT ; print an asterix
1 &counter LOAD SUB &counter STOR ; decrement counter by one
&display &counter LOAD JNZ ; jump to display if not zero
The output of the above program is three stars:
$ ./sm foo.src
*
*
*
You can forward-reference labels. In fact, another idiom is to jump to the
main part of the program at the start of the source.
You can do 72 OUT
to print the letter “H” (72 is the ASCII code for “H”).
Cutting to the chase, a program to print “Hello!” would be:
; Labels are written as a name without whitespace
; and a colon at the end.
main:
72 out ; "H"
101 out ; "e"
108 dup out out ; "ll"
111 out ; "o"
33 out ; "!"
; newline
'\n' out
42 outnum ; print a number
'\n' out ; and newline
; stop program
halt
Notice the use of the HALT
instruction to stop the program.
I’ve implemented a multiplication function in the core library in
tests/core.src
:
mul: ; ( a b -- (a*b) )
mul-res: nop ; placeholder for result
mul-cnt: nop ; placeholder for counter
mul-num: nop
&mul-cnt stor ; b to cnt
dup
&mul-res stor ; a to res
&mul-num stor ; and to num
mul-loop:
; calculate res += a
&mul-res load
&mul-num load +
&mul-res stor
; decrement counter
&mul-cnt load
-1
&mul-cnt stor
; loop until counter is zero
&mul-cnt load
&mul-loop swap -1 jnz
&mul-res load
popip
; ...
*: ; alias for mul
mul
popip
Note that this function needs definitions for the functions +
and -1
.
Recall the program to multiply two numbers. Put the following in a file
hey.src
:
3 2 * outnum
'\n' out
halt
If we concatenate the core library with our program, we get:
$ cat tests/core.src hey.src | ./sm
6
You could implement the whole program without depending on the core library:
; semi-obfuscated multiply and print
; does not depend on any libraries
; re-inventing the wheel can be very educational!
main:
12345 67890 * outnum
'\n' out
halt
; multiplication function w/inner loop
*:
R: nop C: nop N: nop
&C stor dup &R stor &N stor
*-loop:
&R load &N load add &R stor
1 &C load sub &C stor
&C load &*-loop swap 1 swap sub jnz
&R load
popip
While implementing the Karatsuba algorithm should be quite easy, Toom-Cook
multiplication is left as an exercise for the reader.
I think I need to clarify that this project is actually not a joke. Fun,
absolutely, but not a joke.
I just wanted to create a simple virtual machine and from that I grew a
language. It’s very similar to Forth and PostScript, and we all know those
are extremely powerful — particularly Forth!
Building stuff yourself is a powerful way of learning.
The following is a program to generate and print Fibonacci numbers, taken
from tests/fib.src
:
; Print the well-known Fibonacci sequence
;
; Our word size is only 32-bits, so we can't
; count very far.
; Program starts at main, so jump there
&main jmp
; Create label 'count', which refers to this memory
; address.
;
; The NOP (no operation; do nothing) is only used
; to reserve memory space for a variable.
count:
nop
; Initialize the counter by storing 46 at the address of 'count'.
;
; POPIP will pop the instruction pointer, effectively jumping to
; the next location (probably the caller).
count-init:
46 &count stor
popip
; Shorthand for loading the number at 'count' onto the top of the stack.
;
; The "( -- counter)" comment is similar to Forth's comments, explaining
; that no number is expected on the stack, and after running this function,
; a number ("counter") will be on the stack.
count-get: ; ( -- counter )
&count load ; load number
popip
; Shorthand for decrementing the number on the stack
dec: ; ( a -- a-1 )
1 swap sub
popip
; Store top of stack to 'count', do not alter stack
count-set: ; ( counter -- counter )
dup &count stor
popip
; Decrement counter and return it
count-dec: ; ( -- counter )
count-get dec
count-set
popip
; Print number with a newline without altering stack
show: ; ( number -- number )
dup outnum
'\n' out
popip
; Duplicate two top-most numbers on stack
dup2: ; ( a b -- a b a b )
swap ; b a
dup ; b a a
rol3 ; a a b
dup ; a a b b
rol3 ; a b b a
swap ; a b a b
popip
jump-if-nonzero: ; ( dest_address predicate -- )
swap jnz
popip
; The start of our Fibonacci printing program
main:
count-init
0 show ; first Fibonacci number
1 ; second Fibonacci number
loop:
; add top numbers and show
; a b -> a b a b -> a b (a + b)
dup2 add show
; decrement, loop if non-zero
count-dec &loop jump-if-nonzero
I’ve added a HALT
instruction. This replaces the old idiom of looping
forever to signal that a program was finished:
stop: stop ; form 1
stop: &stop jmp ; form 2
halt ; convenience form
Originally, it was an argument of minimalism for not including any halt
instructions.
Secondly, I’ve added a POPIP
instruction along with automatically storing
the next instruction before performing a jump. This effectively let’s you
call and return from subroutines:
boot:
&main jmp halt
foo: bar: baz:
'\n' '!' 'e' 'c' 'i' 'u' 'j' 'e' 'l' 't' 'e' 'e' 'B'
out out out out out out out out out out out out out
popip
main:
foo bar baz
Third, I never bothered to write my own print number function, because it
would require me to write both division and modulus functions in source
first. So I implemented OUTNUM
that prints a number to the output:
123 OUTNUM '\n' OUT ; prints "123\n"
Lacking is proper string handling. One could say that string handling is
not this language’s strongest point.
To compile and run the examples:
$ make all check
To see the low-level machine instructions:
$ ./smr -h
To execute source code on-the-fly:
$ ./sm filename
To compile source to bytecode:
$ ./smc filename
The assembly language is not documented other than in code, because I’m
actively playing with it.
Although the interpreter is slow, it should be possible to convert stack
operations to a register machine. In fact, it should be trivial to compile
programs to native machine code, e.g. x86.
The instructions are found include/instructions.hpp
:
VALUE OPCODE EXPLANATION
0x00000000 NOP do nothing
0x00000001 ADD pop a, pop b, push a + b
0x00000002 SUB pop a, pop b, push a - b
0x00000003 AND pop a, pop b, push a & b
0x00000004 OR pop a, pop b, push a | b
0x00000005 XOR pop a, pop b, push a ^ b
0x00000006 NOT pop a, push !a
0x00000007 IN read one byte from stdin, push as word on stack
0x00000008 OUT pop one word and write to stream as one byte
0x00000009 LOAD pop a, push word read from address a
0x0000000A STOR pop a, pop b, write b to address a
0x0000000B JMP pop a, goto a
0x0000000C JZ pop a, pop b, if a == 0 goto b
0x0000000D PUSH push next word
0x0000000E DUP duplicate word on stack
0x0000000F SWAP swap top two words on stack
0x00000010 ROL3 rotate top three words on stack once left, (a b c) -> (b c a)
0x00000011 OUTNUM pop one word and write to stream as number
0x00000012 JNZ pop a, pop b, if a != 0 goto b
0x00000013 DROP remove top of stack
0x00000014 PUSHIP push a in IP stack
0x00000015 POPIP pop IP stack to current IP, effectively performing a jump
0x00000016 DROPIP pop IP, but do not jump
0x00000017 COMPL pop a, push the complement of a
The instruction set could easily be more minimal, even more so if we allowed
registers. Also, we have taken absolutely no care about the machine code
values for each instruction. A good design would do something cool with
that.
Placed in the public domain in 2010 by the author, Christian Stigen Larsen
http://csl.sublevel3.org