Mittwoch, 10. Juni 2015

Tuning Algorithm::Diff

Motivation

There always is a need for speed. During tuning one of my applications in the field of OCR postprocessing, the profiler identified the LCS (Longest Common Subsequence) routine as the most time consuming part. But the big question was: Is it possible to make the already fast Algorithm::Diff and Algorithm::Diff::XS faster.

Reading the code again and again, trying some micro-optimizations there was only small progress.

OK, don't twiddle, try better algorithms.

So I spent a lot of time reading papers, then trying to implement the described algorithms. Each implementation took some time to debug. Some do not work as they should, i.e. they fail test cases. Some are slow in Perl.

After a while I had a collection of various implementations of the so called Hunt-Szymansky algorithm and improvements of it. This is the same algorithm Algorithm::Diff is based on.

During tuning a bit-vector implementation I compared it to my already tuned version of Algorithm::Diff, and merged a part of code of the bit-vector variant back. The result of the benchmark surprised me, as the now tuned Algorithm::Diff was only 10 percent slower than the bit-vector variant.

This motivated me to publish it as a reliable and portable pure Perl module under the name LCS::Tiny.

Here are the steps making it so fast.

Remove features

Algorithm::Diff::LCSidx has two additional parameters:

- a code ref to a key generating function
- a code ref to a compare function

Usually they would not be specified and the defaults are used, but there is still overhead for checking and calling.

Inlining

When Algorithm::Diff::LCSidx is called then four subroutines are involved. Inlining the subroutines into one subroutine reduces the lines of code and runtime overhead. Of course the code would maybe become less readable.

Squeezing

Try to use less instructions. Use Devel::NYTProf and benchmarks to check the changes. This sort of tuning can be very time consuming.

Here is an example from Algorithm::Diff:


sub _withPositionsOfInInterval
{
    my $aCollection = shift;    # array ref
    my $start       = shift;
    my $end         = shift;
    my $keyGen      = shift;
    my %d;
    my $index;
    for ( $index = $start ; $index <= $end ; $index++ )
    {
        my $element = $aCollection->[$index];
        my $key = &$keyGen( $element, @_ );
        if ( exists( $d{$key} ) )
        {
            unshift ( @{ $d{$key} }, $index );
        }
        else
        {
            $d{$key} = [$index];
        }
    }
    return wantarray ? %d : \%d;
}

This is how it looks after squeezing:

  my $bMatches;
  unshift @{ $bMatches->{$b->[$_]} },$_ for $bmin..$bmax;

Resulting module - LCS::Tiny

As bit-vector solutions can have portability problems, I decided to release the now very fast non-bit-vector implementation on CPAN as LCS::Tiny.

Additional speed - LCS::BV

After release of LCS::Tiny I got back to work on the bit-vector implementation, to make it portable for different lengths of words, and support sequences longer than word-length (32 or 64 bit).

Also I tried out improvements for a part of the algorithm. One resulted in a significant accelleration and is now implemented in the released module LCS::BV.

Benchmarks

Short sequences

A nice and typical average case for spelling-correction is the comparison of 'Chrerrplzon' with 'Choerephon'.


use Algorithm::Diff;
use Algorithm::Diff::XS;
use String::Similarity;

use Benchmark qw(:all) ;

use LCS::Tiny;
use LCS;
use LCS::BV;

my @data = (
  [split(//,'Chrerrplzon')],
  [split(//,'Choerephon')]
);

my @strings = qw(Chrerrplzon Choerephon);

if (1) {
    cmpthese( 50_000, {
       'LCS' => sub {
            LCS->LCS(@data)
        },
       'LCSidx' => sub {
            Algorithm::Diff::LCSidx(@data)
        },
        'LCSXS' => sub {
            Algorithm::Diff::XS::LCSidx(@data)
        },
        'LCSbv' => sub {
            LCS::BV->LCS(@data)
        },
        'LCStiny' => sub {
            LCS::Tiny->LCS(@data)
        },
        'S::Sim' => sub {
            similarity(@strings)
        },
    });
}
Gives the following result:

$ perl 50_diff_bench.t 
            (warning: too few iterations for a reliable count)
             Rate     LCS  LCSidx LCStiny   LCSXS   LCSbv  S::Sim
LCS        7022/s      --    -71%    -82%    -87%    -87%    -99%
LCSidx    23923/s    241%      --    -40%    -55%    -57%    -98%
LCStiny   39683/s    465%     66%      --    -25%    -29%    -96%
LCSXS     53191/s    657%    122%     34%      --     -4%    -95%
LCSbv     55556/s    691%    132%     40%      4%      --    -94%
S::Sim  1000000/s  14140%   4080%   2420%   1780%   1700%      --
Here the pure Perl LCS::BV beats Algorithm::Diff::XS (written in C).

Worst case

This example shows the weakness of the algorithms against more complicated cases.

my @data3 = ([qw/a b d/ x 50], [qw/b a d c/ x 50]);
$ perl 50_diff_bench.t 
            (warning: too few iterations for a reliable count)
          Rate     LCS LCStiny  LCSidx   LCSbv   LCSXS
LCS     35.2/s      --    -33%    -38%    -97%    -99%
LCStiny 52.7/s     50%      --     -7%    -95%    -98%
LCSidx  56.5/s     60%      7%      --    -94%    -98%
LCSbv   1020/s   2795%   1836%   1705%      --    -67%
LCSXS   3125/s   8766%   5828%   5428%    206%      --
Here the XS is fastest, because the stressed part is in pure C. The bit-vector implementation scales good, because of the parallelisms. The others fall back to worst case performance near the traditional implementation.

Keine Kommentare:

Kommentar veröffentlichen