stack machine

A simple stack-based virtual machine in C++ with a Forth like programming language

149
22
C++

Stack-Machine

This project contains

  • A simple, stack-based virtual machine for executing low-level instructions
  • An assembler supporting a Forth / PostScript like language
  • An interpreter able to run compiled programs

Architecture and design

The instructions are fixed-width at 32-bits and so are the arithmetic
operands.

By default, programs have 1 million cells available for both program text
and data. This means that a virtual machine memory takes up 4MB plus the
data and instruction stacks.

The text and data regions are overlapped, so you can easily write
self-modifying code (early versions actually required self-modification to
be able to return from subroutine calls, just like Knuth’s MIX, but I’ve
since taken liberty to add such modern convenience into the core instruction
set).

There are no registers. This is a stack machine, after all.

As we know from theoretical computer science, a pushdown automaton needs
two stacks to be Turing equivalent. Therefore we employ two as well; one
for the instruction pointer and one for the data. They live separately from
the text and data region, and are only limited by the host process heap
size.

The machine contains no special facilities besides this: It’s inherently
single-threaded and has no protection mechanisms. Its operation is
completely sandboxed, though, except for access to standard output.

Aim

The project aim was to create a simple machine and language to play around
with. You can benefit from it by reading the source code, playing with a
language similar to Forth, but conceptually simpler, and finally by seeing
how easy it is to build your own system.

The programming language

The language is very similar to Forth and PostScript: You basically write
in RPN — reverse Polish notation. Anything not recognized as an
instruction is put on the data stack, so to put the numbers 3 and 2 on the
stack, just write

3 2

To multiply them, just append with an asterix:

3 2 * ; multiplication

This operation pops the topmost two numbers on the stack and replaces them
with the result of the multiplication. To run such a program, you’d need to
include the core library first, since multiplication is defined as a
function:

$ cat tests/core.src your-file.src | sm
6

Labels, addresses and their values

Labels are identifiers ending with a colon.

They refer to a particular cell in the machine, and you can access their
position, value or execute code from that cell location:

label:      ; create a label for the cell at this location
&label      ; put ADDRESS of label on top of stack
&label LOAD ; put VALUE of label's cell "label" on top-of-stack
label       ; EXECUTE code from label position

So, to put the address of a label on the top of the data stack, just
prepend the label name with an ampersand.

If you want the value of an address, put the address on the TOS (top of
stack) and use the LOAD instruction to replace the TOS with the value at
the given cell location.

When executing code at a given label position, the machine first puts the
address of the next instruction on top of the instruction stack. This way
you can return from a function call by using the instruction POPIP:

main:       ; program start
  print-dot
  print-dot
  HALT

print-dot:
  '.' OUT
  '\n' OUT
  POPIP     ; return from "function"

Variables and subroutines

An idiom for creating variables is to create labels and putting a NOP at
that location to reserve one memory cell to hold variables. An example of
using a counter variable to implement a loop is given below.

counter: NOP                     ; reserve 1 word for the variable "counter"

program: 2 &counter STOR                       ; set counter to two
         &counter LOAD 1 ADD &counter STOR     ; increment counter by one

; loop counter+1 times

display: '\n' '*' OUT OUT                      ; print an asterix
         1 &counter LOAD SUB &counter STOR     ; decrement counter by one
         &display &counter LOAD JNZ            ; jump to display if not zero

The output of the above program is three stars:

$ ./sm foo.src
*
*
*

You can forward-reference labels. In fact, another idiom is to jump to the
main part of the program at the start of the source.

Hello, world!

You can do 72 OUT to print the letter “H” (72 is the ASCII code for “H”).
Cutting to the chase, a program to print “Hello!” would be:

; Labels are written as a name without whitespace
; and a colon at the end.

main:
   72 out          ; "H"
  101 out          ; "e"
  108 dup out out  ; "ll"
  111 out          ; "o"
   33 out          ; "!"

  ; newline
  '\n' out

  42 outnum     ; print a number
  '\n' out      ; and newline

  ; stop program
  halt

Notice the use of the HALT instruction to stop the program.

Multiplication and core library

I’ve implemented a multiplication function in the core library in
tests/core.src:

mul:            ; ( a b -- (a*b) )
  mul-res: nop  ; placeholder for result
  mul-cnt: nop  ; placeholder for counter
  mul-num: nop

  &mul-cnt stor ; b to cnt
  dup
  &mul-res stor ; a to res
  &mul-num stor ; and to num

  mul-loop:
    ; calculate res += a
    &mul-res load
    &mul-num load +
    &mul-res stor

    ; decrement counter
    &mul-cnt load
    -1
    &mul-cnt stor

    ; loop until counter is zero
    &mul-cnt load
    &mul-loop swap -1 jnz

  &mul-res load
  popip

; ...

*:        ; alias for mul
  mul
  popip

Note that this function needs definitions for the functions + and -1.

Recall the program to multiply two numbers. Put the following in a file
hey.src:

3 2 * outnum
'\n' out
halt

If we concatenate the core library with our program, we get:

$ cat tests/core.src hey.src | ./sm
6

You could implement the whole program without depending on the core library:

; semi-obfuscated multiply and print
; does not depend on any libraries

; re-inventing the wheel can be very educational!

main:
  12345 67890 * outnum
  '\n' out
  halt

; multiplication function w/inner loop
*:
  R: nop C: nop N: nop
  &C stor dup &R stor &N stor

  *-loop:
    &R load &N load add &R stor
    1 &C load sub &C stor
    &C load &*-loop swap 1 swap sub jnz

  &R load
  popip

While implementing the Karatsuba algorithm should be quite easy, Toom-Cook
multiplication is left as an exercise for the reader.

It’s not a joke

I think I need to clarify that this project is actually not a joke. Fun,
absolutely, but not a joke.

I just wanted to create a simple virtual machine and from that I grew a
language. It’s very similar to Forth and PostScript, and we all know those
are extremely powerful — particularly Forth!

Building stuff yourself is a powerful way of learning.

A Fibonacci program

The following is a program to generate and print Fibonacci numbers, taken
from tests/fib.src:

; Print the well-known Fibonacci sequence
;
; Our word size is only 32-bits, so we can't
; count very far.

; Program starts at main, so jump there

&main jmp

; Create label 'count', which refers to this memory
; address.
;
; The NOP (no operation; do nothing) is only used
; to reserve memory space for a variable.

count:
  nop

; Initialize the counter by storing 46 at the address of 'count'.
;
; POPIP will pop the instruction pointer, effectively jumping to
; the next location (probably the caller).

count-init:
  46 &count stor
  popip

; Shorthand for loading the number at 'count' onto the top of the stack.
;
; The "( -- counter)" comment is similar to Forth's comments, explaining
; that no number is expected on the stack, and after running this function,
; a number ("counter") will be on the stack.

count-get: ; ( -- counter )
  &count load     ; load number
  popip

; Shorthand for decrementing the number on the stack

dec: ; ( a -- a-1 )
  1 swap sub
  popip

; Store top of stack to 'count', do not alter stack

count-set: ; ( counter -- counter )
  dup &count stor
  popip

; Decrement counter and return it

count-dec: ; ( -- counter )
  count-get dec
  count-set
  popip

; Print number with a newline without altering stack

show: ; ( number -- number )
  dup outnum
  '\n' out
  popip

; Duplicate two top-most numbers on stack

dup2: ; ( a b -- a b a b )
  swap       ; b a
  dup        ; b a a
  rol3       ; a a b
  dup        ; a a b b
  rol3       ; a b b a
  swap       ; a b a b
  popip

jump-if-nonzero: ; ( dest_address predicate -- )
  swap jnz
  popip

; The start of our Fibonacci printing program

main:
  count-init

  0 show  ; first Fibonacci number
  1       ; second Fibonacci number

  loop:
    ; add top numbers and show
    ; a b -> a b a b -> a b (a + b)
    dup2 add show

    ; decrement, loop if non-zero
    count-dec &loop jump-if-nonzero

Convenience features

I’ve added a HALT instruction. This replaces the old idiom of looping
forever to signal that a program was finished:

stop: stop      ; form 1
stop: &stop jmp ; form 2
halt            ; convenience form

Originally, it was an argument of minimalism for not including any halt
instructions.

Secondly, I’ve added a POPIP instruction along with automatically storing
the next instruction before performing a jump. This effectively let’s you
call and return from subroutines:

boot:
  &main jmp halt

foo: bar: baz:
  '\n' '!' 'e' 'c' 'i' 'u' 'j' 'e' 'l' 't' 'e' 'e' 'B'
  out out out out out out out out out out out out out
  popip

main:
  foo bar baz

Third, I never bothered to write my own print number function, because it
would require me to write both division and modulus functions in source
first. So I implemented OUTNUM that prints a number to the output:

123 OUTNUM '\n' OUT ; prints "123\n"

Lacking is proper string handling. One could say that string handling is
not this language’s strongest point.

Compiling the project

To compile and run the examples:

$ make all check

To see the low-level machine instructions:

$ ./smr -h

To execute source code on-the-fly:

$ ./sm filename

To compile source to bytecode:

$ ./smc filename

The assembly language is not documented other than in code, because I’m
actively playing with it.

Although the interpreter is slow, it should be possible to convert stack
operations to a register machine. In fact, it should be trivial to compile
programs to native machine code, e.g. x86.

Instruction set

The instructions are found include/instructions.hpp:

VALUE       OPCODE  EXPLANATION
0x00000000  NOP     do nothing
0x00000001  ADD     pop a, pop b, push a + b
0x00000002  SUB     pop a, pop b, push a - b
0x00000003  AND     pop a, pop b, push a & b
0x00000004  OR      pop a, pop b, push a | b
0x00000005  XOR     pop a, pop b, push a ^ b
0x00000006  NOT     pop a, push !a
0x00000007  IN      read one byte from stdin, push as word on stack
0x00000008  OUT     pop one word and write to stream as one byte
0x00000009  LOAD    pop a, push word read from address a
0x0000000A  STOR    pop a, pop b, write b to address a
0x0000000B  JMP     pop a, goto a
0x0000000C  JZ      pop a, pop b, if a == 0 goto b
0x0000000D  PUSH    push next word
0x0000000E  DUP     duplicate word on stack
0x0000000F  SWAP    swap top two words on stack
0x00000010  ROL3    rotate top three words on stack once left, (a b c) -> (b c a)
0x00000011  OUTNUM  pop one word and write to stream as number
0x00000012  JNZ     pop a, pop b, if a != 0 goto b
0x00000013  DROP    remove top of stack
0x00000014  PUSHIP  push a in IP stack
0x00000015  POPIP   pop IP stack to current IP, effectively performing a jump
0x00000016  DROPIP  pop IP, but do not jump
0x00000017  COMPL   pop a, push the complement of a

The instruction set could easily be more minimal, even more so if we allowed
registers. Also, we have taken absolutely no care about the machine code
values for each instruction. A good design would do something cool with
that.

License and author

Placed in the public domain in 2010 by the author, Christian Stigen Larsen
http://csl.sublevel3.org