Little Man Computer — Assembly Code

Cover photo isn’t related, but it’s almost as interesting as my post.

Machine Code vs. Assembly Code

You’ll often hear these two terms used interchangeably, but they are not the same. People will often state that machine code is the actual “0’s” and “1’s” a machine executes, and that’s pretty close to being correct. People think that if you can write assembly, you’re a damn genius, and if you can write machine code that you are the smartest person they know. In short, if you’re writing machine code, you’re not all that smart, it is cumbersome and slow. Assembly code is close enough, adds symbols, labels, comments, and other features, and if you really don’t trust your assembler’s performance, you can later tweak the machine code. Gross.

Let’s do it anyway. We’ll build an incredibly simple program in both assembly and machine code, and then see why humans don’t write machine code.

Memory CellMachine CodeWhat it does
00901Get input from the user and put in accumulator
01205Subtract (2) the value from memory cell (05) from the accumulator
02704Branch if zero (7) to memory cell (04)
03600Branch Always to memory cell (00)
04000Halt execution
05123Store valid PIN number

So, the actual file that the LMC simulator would see would be this:

901
205
704
600
000
123

This machine code I didn’t write by hand. I wrote an assembly program and then used an assembler to turn it into machine code. Here’s my actual source:

//Pin Number Checker - Basic (PIN = 123). R.Lerner 2023/10/27
getpin	INP         // Get PIN Code from user
	SUB pincode // Subtract correct "pincode"
	BRZ good    // If the value is zero, go to "good"
	BRA getpin  // Otherwise, ask for the pin number again
good	HLT         // Correct password. End the program.
pincode DAT 123	    // Store the valid pin code to "pincode"

Isn’t that nicer? We can add comments all over here (everything after the //, or more often ; in assembly). We aren’t telling the code to go to “memory location 00”, we’re telling it to go to “getpin” or “good”, and we’re also referring to the memory location “05” as “pincode”. Plus, isn’t it a lot easier learning what INP, OUT, HLT, SUB and the like all mean? The numbers are functional but odd, especially for large architectures.

How does an “Assembler” do it?

You’ll notice that the OpCode table from my first blog, and the mnemonics / opcodes above line up pretty closely. INP becomes 901, HLT becomes 000 — but what about SUB? In the other blog I call this opcode “2xx” and here it is “205”… That’s because we’re subtracting, and using memory/cell “05” to get the value to subtract (in this case 123).

Typing “205” is absolute — if ANYTHING changes in this program, it puts this code at risk of changing. What if we update it?

901
205
704
600
000
999
123

Now I’ve added a new memory cell, the value is “999” and it is stored at location “05”. The previous, and intended value “123” is now stored in cell “06”. Now, running this code is going to search for “999” as the PIN.

The same is true for line labels, for example, in the original program, we BRZ to the label “good” — or in machine code, we 704. We’re branching if positive (BRZ or 7xx) to location 04. If we wrote this out in machine code, and then changed the code, we might forget to update the actual new location of this, and branch to a potentially dangerous piece of code — or into user data.

Assembly allows the use of labels, like “getpin” and “good” and variables like “pincode” that will automatically update when the source is assembled.

Symbol Table

I’m going to speak to how my assembler works, of course others may work different. First, the assembler will strip out anything that doesn’t make sense (or throw exceptions). Line indentation, line endings, comments — stuff the assembler doesn’t care about, or need, to do its work.

Next, it generates a symbol table. This takes a few passes over the code. The first step identifies any instructions that point to memory locations, then it identifies labels.

From my code above, my assembler outputs this as the symbol table:

000:GETPIN
004:GOOD
005:PINCODE

My assembler is not perfect. I plan to enhance it eventually. Right now, you can see the first element is 000:GETPIN — this lets my script know that any occurrence in the code later will refer to location 00. GOOD is 04, etc.

What’s wrong with the assembler?

  • Symbol table does not indicate what type of symbol it is. If you make a line label and then later use it as a variable, then there will be unintended consequences. This also means this file is less consumable because whatever disassembler you may use down the road will look at this and perform its work wrong.
  • In the code they’re lowercase, in the symbol table they’re uppercase. I made my symbol table case insensitive, and in this case, even with this file a disassembler would generate code different as the output.

Once it iterates over all the lines of code, marking where the symbols were encountered, it then adds a reference inside of the code. So, the “Intermediate value” would look like this:

901
2%PINCODE%
7%GOOD%
6%GETPIN%
000
123

Actual Assembling

Now you can see the weird middle — the opcodes are appearing, but we’re seeing %PINCODE%, %GOOD% etc in the code. The “%%” are just delimiters, it makes sure there’s nothing being interpreted incorrectly.

The final step is to dig back into the symbol table and add the related values to the end (taking off the leading zero). Here, you can see the symbols next to the commands

901
2(05)  %PINCODE%
7(04)  %GOOD%
6(00)  %GETPIN%
000
123

Sanity Checks

For an actual binary for the LMC, the only characters that can be in it are 0-9, in groups of three, with a line ending (000, 001 … 998, 999). If anything else is present, the assembler will throw a malformed output warning.

Additionally, after compilation, if >100 cells will be used, then it will throw an error. The LMC only has room for 100 instructions/data. This is determined through static analysis of the code, a dynamic program could (in theory) break out of this check.

It will also make sure that the total output is evenly divisible by 3, the opcode length.

The last and final check is for HLT, it isn’t perfect, but it tries to ensure there is a HLT (end program) before the data sections. This will stop data from being executed as code.

Where’s the Linker?

Being that these programs have a maximum length of 100 instructions, and have no mechanism for functions / subroutines, I did not think it made sense to build a linker.

Giveaway

I’m going to be reading the comments in this blog series, and when it concludes, I will reach out to the person with the best comment or question. I have one copy of the below book I’ll send out to that person if they will share their address with me. It’s not my book, I just have two of them and want to give one away to somebody who is interested.


Posted

in

, ,

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.