?

Log in

No account? Create an account

Game, Which is faster?

« previous entry | next entry »
Oct. 29th, 2009 | 12:41 pm

Last night while working with someone I am tutoring we got into a conversation of "what is faster". Below is some assembler that was created from it, for two different programs which did exactly the same thing, but in different ways.

The question is, which is faster?
Second Question, which do you think was the more maintainable code?

	.file	"?"
	.text
.globl main
	.type	main, @function
main:
.LFB0:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	movq	%rsp, %rbp
	.cfi_offset 6, -16
	.cfi_def_cfa_register 6
	subq	$2097184, %rsp
	leaq	-2097184(%rbp), %rax
	movl	$1048576, %edx
	movl	$64, %esi
	movq	%rax, %rdi
	call	memset
	movl	$0, -20(%rbp)
	jmp	.L2
.L5:
	leaq	-1048608(%rbp), %rax
	movq	%rax, -16(%rbp)
	leaq	-2097184(%rbp), %rax
	movq	%rax, -8(%rbp)
	jmp	.L3
.L7:
	nop
.L3:
	movq	-8(%rbp), %rax
	movzbl	(%rax), %edx
	movq	-16(%rbp), %rax
	movb	%dl, (%rax)
	movq	-16(%rbp), %rax
	leaq	1(%rax), %rdx
	movq	-8(%rbp), %rax
	addq	$1, %rax
	movzbl	(%rax), %eax
	movb	%al, (%rdx)
	movq	-16(%rbp), %rax
	leaq	2(%rax), %rdx
	movq	-8(%rbp), %rax
	addq	$2, %rax
	movzbl	(%rax), %eax
	movb	%al, (%rdx)
	movq	-16(%rbp), %rax
	leaq	3(%rax), %rdx
	movq	-8(%rbp), %rax
	addq	$3, %rax
	movzbl	(%rax), %eax
	movb	%al, (%rdx)
	movq	-16(%rbp), %rax
	leaq	4(%rax), %rdx
	movq	-8(%rbp), %rax
	addq	$4, %rax
	movzbl	(%rax), %eax
	movb	%al, (%rdx)
	movq	-16(%rbp), %rax
	leaq	5(%rax), %rdx
	movq	-8(%rbp), %rax
	addq	$5, %rax
	movzbl	(%rax), %eax
	movb	%al, (%rdx)
	movq	-16(%rbp), %rax
	leaq	6(%rax), %rdx
	movq	-8(%rbp), %rax
	addq	$6, %rax
	movzbl	(%rax), %eax
	movb	%al, (%rdx)
	movq	-16(%rbp), %rax
	leaq	7(%rax), %rdx
	movq	-8(%rbp), %rax
	addq	$7, %rax
	movzbl	(%rax), %eax
	movb	%al, (%rdx)
	addq	$8, -16(%rbp)
	addq	$8, -8(%rbp)
	leaq	-1048608(%rbp), %rax
	addq	$1048576, %rax
	cmpq	-16(%rbp), %rax
	jne	.L7
.L4:
	addl	$1, -20(%rbp)
.L2:
	cmpl	$4095, -20(%rbp)
	jbe	.L5
	movl	$0, %eax
	leave
	ret
	.cfi_endproc
.LFE0:
	.size	main, .-main
	.ident	"GCC: (GNU) 4.4.1 20090725 (Red Hat 4.4.1-2)"
	.section	.note.GNU-stack,"",@progbits


	.file	"?"
	.text
.globl main
	.type	main, @function
main:
.LFB0:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	movq	%rsp, %rbp
	.cfi_offset 6, -16
	.cfi_def_cfa_register 6
	subq	$2097184, %rsp
	leaq	-2097184(%rbp), %rax
	movl	$1048576, %edx
	movl	$64, %esi
	movq	%rax, %rdi
	call	memset
	movl	$0, -20(%rbp)
	jmp	.L2
.L5:
	leaq	-1048608(%rbp), %rax
	movq	%rax, -16(%rbp)
	leaq	-2097184(%rbp), %rax
	movq	%rax, -8(%rbp)
	jmp	.L3
.L4:
	movq	-8(%rbp), %rax
	movzbl	(%rax), %edx
	movq	-16(%rbp), %rax
	movb	%dl, (%rax)
	addq	$1, -16(%rbp)
	addq	$1, -8(%rbp)
.L3:
	leaq	-1048608(%rbp), %rax
	addq	$1048576, %rax
	cmpq	-16(%rbp), %rax
	jne	.L4
	addl	$1, -20(%rbp)
.L2:
	cmpl	$4095, -20(%rbp)
	jbe	.L5
	movl	$0, %eax
	leave
	ret
	.cfi_endproc
.LFE0:
	.size	main, .-main
	.ident	"GCC: (GNU) 4.4.1 20090725 (Red Hat 4.4.1-2)"
	.section	.note.GNU-stack,"",@progbits

Link | Leave a comment | Share

Comments {7}

Dossy

(no subject)

from: dossy
date: Oct. 29th, 2009 08:07 pm (UTC)
Link

The second, obviously ...

WTF is up with code duplication in the first program (see end of .L3 jump to .L7 nop).

Reply | Thread

Dreamer of the Day

(no subject)

from: iamo
date: Oct. 29th, 2009 08:15 pm (UTC)
Link

Looks like an unrolled loop to me. Less jumping == less pipeline stalls == faster code.

Reply | Parent | Thread

dude

(no subject)

from: jayp39
date: Oct. 29th, 2009 08:22 pm (UTC)
Link

Without knowing much of anything about assembly, the amount of overhead involved in each instruction, the amount of cache available on the CPU doing the benchmarking, or the length of the pipelines, I'm going to guess the one with less branching, which my very uneducated guess is the first one.
The seemingly duplicate code may in fact be the unrolling of a loop which in *some* cases can yield performance improvements, assuming that the unrolling doesn't make the code size large enough to not fit in cache.

Something like that, anyway.

And of course, the latter is likely to be more maintainable, what with it being less code and less code duplication.

Disclaimer: these are wild guesses coming from someone who has been awake for over 24 hours, and who only glanced at the code since he doesn't really understand assembly anyway.

Reply | Thread

Lumiere

(no subject)

from: lumiere
date: Oct. 30th, 2009 03:23 am (UTC)
Link

In what context?

Micro-benchmarks can have misleading results when taken out of context. For example, assuming the unrolled code segment runs slightly faster on its own, you might then want to check what effect it has on the L1 cache and other CPU resources, as these could make the overall system slower even if you're spending less time in this one section.

Of course, if that's the hottest inner loop--and it ought to be if you're doing performance tuning at this level--then this is unlikely...

Reply | Thread

Lover of Ideas

(no subject)

from: omnifarious
date: Oct. 30th, 2009 09:31 am (UTC)
Link

I'm with the other people. The first looks like an unrolled loop. I'm guessing the second is a lot more maintainable.


Of course, with the right options to gcc you can get it to unroll the loop for you and have the best of both worlds.

Reply | Thread

Brian "Krow" Aker

(no subject)

from: krow
date: Oct. 30th, 2009 04:17 pm (UTC)
Link

You might be surprised about how poorly gcc handles the above :)

The second is much more maintainable, though not really by a lot (aka... it is about 9 lines of different code).

Reply | Parent | Thread

Lover of Ideas

(no subject)

from: omnifarious
date: Oct. 30th, 2009 10:20 pm (UTC)
Link

You know, after examining things more carefully, I'm guessing the second might well be faster. The second does have some mildly expensive operations for each loop, but the first has several inefficiencies with each section of unrolled code.

It's also clear that neither were compiled with compiler optimizations turned on. gcc can do a much better job than that.

This code is copying the least significant byte of an array of 218 4 byte values into a different array of chars or unsigned chars. They are both allocated on the stack.



Edited at 2009-10-30 10:21 pm (UTC)

Reply | Parent | Thread