Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+

Jeff Garzik wrote:

Matt Mackall wrote:

Have you benchmarked this against lib/sha1.c? Please post the results.
Until then, I'm frankly skeptical that your unrolled version is faster
because when I introduced lib/sha1.c the rolled version therein won by
a significant margin and had 1/10th the cache footprint.

See the benchmark tables in patch 0 at the head of this thread.Performance improved by at least 25% in every test, and 40-60% was morecommon for the 32-bit version (on a Pentium IV).

It's not just the loop unrolling; it's the register allocation andspilling. For comparison, I built SHATransform() from thedrivers/char/random.c in 2.6.11, using gcc 3.3.5 with -O2 andSHA_CODE_SIZE == 3 (i.e., fully unrolled); I'm guessing this is prettyclose to what you tested back then. The resulting code is 49% MOVinstructions, and 80% of *those* involve memory. gcc4 is somewhatbetter, but it still spills a whole lot, both for the 2.6.11 unrolledcode and for the current lib/sha1.c.

In contrast, the assembly implementation in this patch only has to go tomemory for data and workspace (with one small exception in the F3rounds), and the workspace has a fifth of the cache footprint of thedefault implementation.

Yes. And it also depends on the CPU as well. Testing on a server-classx86 CPU (often with bigger L2, and perhaps even L1, cache) will producedifferent result than from popular but less-capable "value" CPUs.


Good point.  I benchmarked the 32-bit assembly code on a couple more boxes:

=== AMD Duron, average of 5 trials ===
Test#  Bytes/  Bytes/  Cyc/B  Cyc/B  Change
        block  update    (C)  (asm)
    0      16      16    104     72     31%
    1      64      16     52     36     31%
    2      64      64     45     29     36%
    3     256      16     33     23     30%
    4     256      64     27     17     37%
    5     256     256     24     14     42%
    6    1024      16     29     20     31%
    7    1024     256     20     11     45%
    8    1024    1024     19     11     42%
    9    2048      16     28     20     29%
   10    2048     256     19     11     42%
   11    2048    1024     18     10     44%
   12    2048    2048     18     10     44%
   13    4096      16     28     19     32%
   14    4096     256     18     10     44%
   15    4096    1024     18     10     44%
   16    4096    4096     18     10     44%
   17    8192      16     27     19     30%
   18    8192     256     18     10     44%
   19    8192    1024     18     10     44%
   20    8192    4096     17     10     41%
   21    8192    8192     17     10     41%

=== Classic Pentium, average of 5 trials ===
Test#  Bytes/  Bytes/  Cyc/B  Cyc/B  Change
        block  update    (C)  (asm)
    0      16      16    145    144      1%
    1      64      16     72     61     15%
    2      64      64     65     52     20%
    3     256      16     46     39     15%
    4     256      64     39     32     18%
    5     256     256     36     29     19%
    6    1024      16     40     33     18%
    7    1024     256     30     23     23%
    8    1024    1024     29     23     21%
    9    2048      16     39     32     18%
   10    2048     256     29     22     24%
   11    2048    1024     28     22     21%
   12    2048    2048     28     22     21%
   13    4096      16     38     32     16%
   14    4096     256     28     22     21%
   15    4096    1024     28     21     25%
   16    4096    4096     27     21     22%
   17    8192      16     38     32     16%
   18    8192     256     28     22     21%
   19    8192    1024     28     21     25%
   20    8192    4096     27     21     22%
   21    8192    8192     27     21     22%

The improvement isn't as good, but it's still noticeable.

--Benjamin Gilbert

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
  - From: Matt Mackall <[email protected]>

References:
- [PATCH 0/3] Add optimized SHA-1 implementations for x86 and x86_64
  - From: Benjamin Gilbert <[email protected]>
- [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
  - From: Benjamin Gilbert <[email protected]>
- Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
  - From: Matt Mackall <[email protected]>
- Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
  - From: Jeff Garzik <[email protected]>

Prev by Date: Re: AppArmor FAQ
Next by Date: Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
Previous by thread: Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
Next by thread: Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]