What is Profiling?
Profiling is measuring the performance of code, whether it is a small function or an entire application. You can determine how quickly the code will execute various requests and tune them to meet your needs.
Cycle Counts
The Little Man Computer is easy to do math on, it is 1:1 — one cycle is one operation (though, if you look at the underlying code that performs those functions, some take more steps than others). In a real computer, something may take several cycles to perform.
If you have a 1MHz processor, you’re probably quite old. I know, I have a MOS 6502. That’s 1 million cycles a second clock speed — but if one of your instructions takes two cycles (or more) then that clock speed is less useful for figuring out how fast a computer is — but can really help you tune your program.
For the sake of this discussion, all LMC clock cycles and instruction cycles are the same thing, and each take exactly the same amount of time*.
*Nerd stuff: This isn’t accurate, the code is ran procedural. A real computer would have some sort of clock generator (usually a crystal oscillator) that is used to time everything, and that clock trigger impulse tells the CPU to iterate the Program Counter and move onto the next step. In this case, the program counter is tied to the operations we’re doing (adding, reading, subtracting, etc) and there is minor variability in these steps.
Let’s do Math!
You might remember the OpCode table, where the only math we can do on an LMC is add and subtract — you might need do occasionally do multiplication — but how?
It’s quite easy! Let’s start by doing 2 * 2:
2 + 2 = 4
That seems like it would be easy, lets just go ahead and take the first number and multiply it by the second!
Let’s do 20 * 2:
20 + 20 = 40
Let’s do 2 * 20:
2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 = 40
Wow… Same exact question, same answer, very different results. So, what is the best approach? Let’s build a simple multiplication program and then tune it.
// Multiply Two Numbers, Slowly - R.Lerner 2023/10/28
// Get Numbers to Multiply
INP
STA a
INP
STA b
LDA a
// Store Ouput Number
STA output
// Subtract 1 from "b", so 5x5 is now 5x4, and the output = 5
LDA b
SUB one
STA b
// Since there's no multiplication operand, we have to loop ANDs
MULT BRZ DONE// If zero is in var B jump to done
LDA output
ADD a
STA output
LDA b // Set accum. to "b" for MULT / BRZ compare
SUB one
STA b
BRA MULT// Jump back to multiply loop until b is at zero.
// Done doing math, output
DONE LDA output
OUT
HLT
a DAT
b DAT
output DAT
one DAT 1
I know all of you folks that use higher level languages are saying “gross, I can just do var x = 2*20;
omg.
Yes, you can, but your computer is still doing what I show above and somebody wrote that at some point.
Let’s look at how many operations things take here:
Multiplication | Cycles Required |
1*10 | 85 |
10*1 | 13 ** |
10*2 | 21 *** |
5*100 | 805 |
100*5 | 45 |
20*20 | 400 |
** – For brevity, here’s what is happening with 10*1 during those 13 cycles:
(Entered “10” on keyboard)
1: [10 / +] Read Input to accumulator
2: [10 / +] Setting RAM location ’20’ to accumulator’s value
(Entered “1” on keyboard)
3: [1 / +] Read Input to accumulator
4: [1 / +] Setting RAM location ’21’ to accumulator’s value
5: [1 / +] Load value at location 20 into the accumulator
6: [10 / +] Setting RAM location ’22’ to accumulator’s value
7: [10 / +] Load value at location 21 into the accumulator
8: [1 / +] Subtracting Value 1 from 23 from the accumulator
9: [0 / +] Setting RAM location ’21’ to accumulator’s value
17: [0 / +] Break if Accumulator = Zero to location 17
18: [0 / +] Load value at location 22 into the accumulator
19: [10 / +] Add accumulator to Output
20: [10 / +] Halting
20: [10 / +] 13 total instruction(s).
*** – How about 10*2? I scratched off the items that are in 10*1:
(Entered “10” on keyboard)
1: [10 / +] Read Input to accumulator
2: [10 / +] Setting RAM location ’20’ to accumulator’s value
(Entered “2” on keyboard)
3: [2 / +] Read Input to accumulator
4: [2 / +] Setting RAM location ’21’ to accumulator’s value
5: [2 / +] Load value at location 20 into the accumulator
6: [10 / +] Setting RAM location ’22’ to accumulator’s value
7: [10 / +] Load value at location 21 into the accumulator
8: [2 / +] Subtracting Value 1 from 23 from the accumulator
9: [1 / +] Setting RAM location ’21’ to accumulator’s value
10: [1 / +] Break if Accumulator = Zero to location 17
11: [1 / +] Load value at location 22 into the accumulator
12: [10 / +] Adding Value 10 from 20 to accumulator
13: [20 / +] Setting RAM location ’22’ to accumulator’s value
14: [20 / +] Load value at location 21 into the accumulator
15: [1 / +] Subtracting Value 1 from 23 from the accumulator
16: [0 / +] Setting RAM location ’21’ to accumulator’s value
17: [0 / +] Jump to 09
17: [0 / +] Break if Accumulator = Zero to location 17
18: [0 / +] Load value at location 22 into the accumulator
19: [20 / +] Add accumulator to Output
20: [20 / +] Halting
20: [20 / +] 21 total instruction(s).
What changed between 10*1 and 10*2?
Like we mentioned above, we have to do things differently:
10 * 1 is just “10”
10 * 2 is 10+10
In this case, I took care to put the larger number first, so the jump from 10*1 to 10*2 was negligible, but since we have to ADD each number, we have to loop over and over again until one of these numbers are zero.
Here’s another table to give you an idea:
Anything multiplied by: | Cycle Count |
* 0 | 21 |
* 1 | 13 |
* 2 | 21 |
* 3 | 29 |
* 4 | 37 |
* 5 | 45 |
Now this would be fine, but users aren’t always that quick and don’t read manuals. Because your code might be used by a company you have stock in, you want it to perform the best possible.
So that’s the end of profiling that code. Maybe it is time to write some new code?
Multiplication v2
Here’s some code I wrote over a year ago, it makes a few minor changes that make a massive difference:
- If one of the numbers entered is a zero, the program returns zero and exits
- If the second number provided is larger, it is swapped with the first number, and then the math is done
These changes mean that it is actually slower than the first example for people who know to put the large number first — but it’s also significantly faster than the first for all other use cases.
// Multiply Two Numbers - R.Lerner 2022/07/23 - Reduces cycle count by multiplying large numbers instead of small.
INP
STA a
BRZ ZERO // If zero value, output zero
INP
STA b
BRZ ZERO // If zero value, output zero
SUB a
BRP SWAPVAR // Branch to swap a & b
BRA DOMATH
SWAPVAR LDA b
STA swap
LDA a
STA b
LDA swap
STA a
DOMATH LDA a
STA output
LDA b
SUB one
STA b
MULT BRZ DONE // If zero is in var B jump to done
LDA output
ADD a
STA output
LDA b // Set accum. to "b" for MULT / BRZ compare
SUB one
STA b
BRA MULT // Jump back to multiply loop
ZERO OUT
HLT
DONE LDA output
OUT
HLT
a DAT
b DAT
output DAT
one DAT 1
swap DAT
Multiplication | Cycles Required Version 2 | Cycles Required Version 1 |
1*10 | 23 | 85 |
10*1 | 18 | 13 |
10*2 | 21 | 26 |
5*100 | 55 | 805 |
100*5 | 50 | 45 |
20*20 | 175 | 165 |
I consider this an overall success, unless you have a use case where you have:
- Users that always RTFM, and actually follow it
- Fault / risk tolerance for the times they forget or do not follow it
- Critically short runtime
Comments
How can you improve it? I’d love to see some code samples in the comments!
Giveaway
I’m going to be reading the comments in this blog series, and when it concludes, I will reach out to the person with the best comment or question. I have one copy of the below book I’ll send out to that person if they will share their address with me. It’s not my book, I just have two of them and want to give one away to somebody who is interested.
Leave a Reply