“Why Learning Assembly Language Is Still a Good Idea”

Good article from Randall Hyde (author of “Write Great Code”) about why learning assembly is still a good idea. As a good example I’ve been optimising some code to see how fast I can get it (it consumes about 1/5th my program runtime), so far:

1x Plain C (ugly, lots of additions, multiplications and branching loops)
10x Assembly (first cut, uses branches, additions only in a loop)
19x Assembly (second cut, loops are unrolled, simple additions only)

Probably tonight I’ll see if I can enhance it by interleaving instructions to see if there’s any register stalls lurking around that can be tweaked out. If I double-buffer memory access I can probably speed it up another 3x I hope. It may be possible to switch to a larger loop (maybe 2x or 4x operations) to save cache space, so there’s still a few things to try before it’s fairly close to optimal.

Read the article here.

2 Responses to ““Why Learning Assembly Language Is Still a Good Idea””


  1. 1 Greg

    Nice. Couple of questions: How long did it take per LOC to get to 10x? (I presume after its 1/50 of your runtime you’re just messing around) Did you try to optimise the C code at all?

  2. 2 Philip

    Hm, quite a while (a night) since I am not super familiar with the assembly language required so you need to figure out calling conventions, quirks and interfacing with gcc. I gave it a quick burl in C however the lack of good double-word support means I can’t load 8 bytes at a time, only 4. (I’ll give it a go with a word sized loop to see what improvement there is). The gcc is pretty old (2.x), so there’s less optimisation than hoped over the 3.x or 4.x series.

    Now, I could probably do it in under an hour easily. Ironically, now that I’ve optimized it I’ve hit into caching issues which means I get some pretty harsh penalties as the code now fills the data cache unnecessarily, so have to figure out how to switch my data to an uncached page instead to avoid my fast assembly munging the cache. (Read: benchmarks are speedy, real world operations are slow!)

    That’ll probably be while I figure out how to optimise the data layout since it’s not particularly good at the moment. I’ll probably post some specs and speeds up later.

Leave a Reply