Samstag, 20. Juni 2015

Minimal Coding versus Indirection

Yesterday I tried to use a C implementation of the LCS problem within Perl via XS.

Reuse an existing library (libmba) sounded nice. It has good documentation and minimal foreign dependencies. The promise for libmba::diff is working for sequences of anything.

First problem was to get it compiling, stepping from one missing include to the next. Also had to patch a source file, because of some incompatible code for printing error messages.

Per default libmba::diff works on byte-strings. The comparison with String::Similarity showed only 40% of the speed of String::Similarity. Both use the same algorithm ["An O(ND) Difference Algorithm and its Variations", Eugene Myers,
 Algorithmica Vol. 1 No. 2, 1986, pp. 251-266;], both written in C, both interfaced from Perl via XS.

But ok, fast enough for the first step, can tune later.

In the next step I wanted to support array references instead of strings as input. This needs to call libmba::diff() with references of two subroutines, one for comparing two elements, one for indexing an element. Here I lost (or wasted) some time getting it compiled without warnings (see on Stackoverflow).

The speed slowed down now to 24%, 0.237 MHz versus 1 MHz (i.e. 1 million cases processed per second) of String::Similarity. 

What's the difference in detail?

Let's measure rough figures.

The lines of code of String::Similarity:

~/github/String-Similarity-1.04$ cloc --by-file-by-lang .
      17 text files.
      13 unique files.                              
      17 files ignored.

http://cloc.sourceforge.net v 1.62  T=0.20 s (45.6 files/s, 10777.8 lines/s)

[ details snipped ]

---------------------------------------------------------------------
Language           files          blank        comment           code
---------------------------------------------------------------------
make                   1            231            106            702
C                      2            127            193            570
YAML                   2              0              0             41
JSON                   1              0              0             39
Perl                   2             31             24             38
C/C++ Header           1              7             14              4
---------------------------------------------------------------------
SUM:                   9            396            337           1394
---------------------------------------------------------------------


And the same of LCS::XS:



~/github/LCS-XS$ cloc --by-file-by-lang .
      44 text files.
      42 unique files.                              
       8 files ignored.

http://cloc.sourceforge.net v 1.62  T=0.33 s (115.4 files/s, 37434.7 lines/s)

[ details snipped ]

----------------------------------------------------------------------
Language            files          blank        comment           code
----------------------------------------------------------------------
C/C++ Header           25           1109           3652           4317
C                       6            230            252           1376
make                    1            231            111            704
Perl                    3             70             63            146
JSON                    1              0              0             39
YAML                    2              1              1             28
----------------------------------------------------------------------
SUM:                   38           1641           4079           6610
----------------------------------------------------------------------

The lines of code in C are 1376 versus 570 for the same algorithm. The reason is more abstraction, more subroutines, more internal interfaces, more flexibility. This also causes more documentation for interfaces and more time to read the documentation and master the interfaces.

The size of the compiled objects is 7 KB versus 33 KB.

And a difference of factor 4 in speed.



Mittwoch, 10. Juni 2015

Porting some modules to Perl6

Playing around with Perl6 since 2008 I wanted to port two of my CPAN distributions, Set::Similarity and Bag::Similarity, to Perl6. It would be a nice example of using the Perl6 built-in types Set and Bag (also known as multisets).

Last year I visited the Austrian Perl Workshop and also stayed the additional two days on the hackaton. My intention was to discuss details of the long missing NFG (Normalization Form Grapheme) with the core developers, but they spent their time discussing GLR (Grand List Refactor/Redesign).

So I worked on the port of Set::Similarity and Bag::Similarity.

These modules are reference implementations of similarity coefficients known under the names Dice, Jaccard, cosine and overlap.

Input are always two sequences, which are tokenized if necessary, transformed to sets (arrays of unique tokens in Perl6) or bags (hashes in Perl5).

In case of strings as input they are tokenized into single characters by default, or into n-grams if n is specified.

Splitting into single characters is easy in Perl6, if you know that the method is named comb.


$ perl6
> say "banana".comb;
b a n a n a
Transforming it to a set

> say "banana".comb.Set;
set(b, a, n)
With sets we can use the intersection operator (a logical and reduction)

> say "banana".comb.Set (&) "anna".comb.Set;
set(a, n)
But Perl6 understands it shorter

> say "banana".comb (&) "anna".comb;
set(a, n)
N-grams are not so obvious (but there is more than one way ...)

> my $w = 4;'wollmersdorfer'.comb.rotor($w,$w-1).map({[~] @$_})
woll ollm llme lmer mers ersd rsdo sdor dorf orfe rfer
The Dice coefficient is defined as count(intersection(A,B)) / (count(A) + count(B)). With the pieces from above we can code:

> my $a="banana".comb; my $b="anna".comb; say ($a (&) $b)/($a.Set + $b.Set)
0.4
Very short, isn't it? Note, that Perl6 automatically uses the number of elements of a set in the context of numeric operations.
Next the Jaccard coefficient is defined as count(intersection(A,B)) / count(union(A,B)). If the operator for intersection is the logical and, then the operator for the union should be the logical or. Let's try.

> my $a = 'banana';my $b = 'anna'; my $union = ($a.comb (|) $b.comb)
set(b, a, n)
Put the pieces together:

> y $a = 'banana';my $b = 'anna'; my $jaccard = ($a.comb (&) $b.comb) / ($a.comb (|) $b.comb);
0.666667
Understanding the code for for cosine and overlap is left as an exercise to the reader:

my $cosine = ($a.comb (&) $b.comb) / ($a.comb.Set.elems.sqrt * $b.comb.Set.elems.sqrt);
my $overlap = ($a.comb (&) $b.comb) / ($a.comb.Set.elems, $b.comb.Set.elems).min;
Is this short code really worth own modules? No, it isn't. Thus I contributed it to Perl6-One-Liners as chapter Text analysis.
Many thanks to the Perl6 hackers answering my questions.

Tuning Algorithm::Diff

Motivation

There always is a need for speed. During tuning one of my applications in the field of OCR postprocessing, the profiler identified the LCS (Longest Common Subsequence) routine as the most time consuming part. But the big question was: Is it possible to make the already fast Algorithm::Diff and Algorithm::Diff::XS faster.

Reading the code again and again, trying some micro-optimizations there was only small progress.

OK, don't twiddle, try better algorithms.

So I spent a lot of time reading papers, then trying to implement the described algorithms. Each implementation took some time to debug. Some do not work as they should, i.e. they fail test cases. Some are slow in Perl.

After a while I had a collection of various implementations of the so called Hunt-Szymansky algorithm and improvements of it. This is the same algorithm Algorithm::Diff is based on.

During tuning a bit-vector implementation I compared it to my already tuned version of Algorithm::Diff, and merged a part of code of the bit-vector variant back. The result of the benchmark surprised me, as the now tuned Algorithm::Diff was only 10 percent slower than the bit-vector variant.

This motivated me to publish it as a reliable and portable pure Perl module under the name LCS::Tiny.

Here are the steps making it so fast.

Remove features

Algorithm::Diff::LCSidx has two additional parameters:

- a code ref to a key generating function
- a code ref to a compare function

Usually they would not be specified and the defaults are used, but there is still overhead for checking and calling.

Inlining

When Algorithm::Diff::LCSidx is called then four subroutines are involved. Inlining the subroutines into one subroutine reduces the lines of code and runtime overhead. Of course the code would maybe become less readable.

Squeezing

Try to use less instructions. Use Devel::NYTProf and benchmarks to check the changes. This sort of tuning can be very time consuming.

Here is an example from Algorithm::Diff:


sub _withPositionsOfInInterval
{
    my $aCollection = shift;    # array ref
    my $start       = shift;
    my $end         = shift;
    my $keyGen      = shift;
    my %d;
    my $index;
    for ( $index = $start ; $index <= $end ; $index++ )
    {
        my $element = $aCollection->[$index];
        my $key = &$keyGen( $element, @_ );
        if ( exists( $d{$key} ) )
        {
            unshift ( @{ $d{$key} }, $index );
        }
        else
        {
            $d{$key} = [$index];
        }
    }
    return wantarray ? %d : \%d;
}

This is how it looks after squeezing:

  my $bMatches;
  unshift @{ $bMatches->{$b->[$_]} },$_ for $bmin..$bmax;

Resulting module - LCS::Tiny

As bit-vector solutions can have portability problems, I decided to release the now very fast non-bit-vector implementation on CPAN as LCS::Tiny.

Additional speed - LCS::BV

After release of LCS::Tiny I got back to work on the bit-vector implementation, to make it portable for different lengths of words, and support sequences longer than word-length (32 or 64 bit).

Also I tried out improvements for a part of the algorithm. One resulted in a significant accelleration and is now implemented in the released module LCS::BV.

Benchmarks

Short sequences

A nice and typical average case for spelling-correction is the comparison of 'Chrerrplzon' with 'Choerephon'.


use Algorithm::Diff;
use Algorithm::Diff::XS;
use String::Similarity;

use Benchmark qw(:all) ;

use LCS::Tiny;
use LCS;
use LCS::BV;

my @data = (
  [split(//,'Chrerrplzon')],
  [split(//,'Choerephon')]
);

my @strings = qw(Chrerrplzon Choerephon);

if (1) {
    cmpthese( 50_000, {
       'LCS' => sub {
            LCS->LCS(@data)
        },
       'LCSidx' => sub {
            Algorithm::Diff::LCSidx(@data)
        },
        'LCSXS' => sub {
            Algorithm::Diff::XS::LCSidx(@data)
        },
        'LCSbv' => sub {
            LCS::BV->LCS(@data)
        },
        'LCStiny' => sub {
            LCS::Tiny->LCS(@data)
        },
        'S::Sim' => sub {
            similarity(@strings)
        },
    });
}
Gives the following result:

$ perl 50_diff_bench.t 
            (warning: too few iterations for a reliable count)
             Rate     LCS  LCSidx LCStiny   LCSXS   LCSbv  S::Sim
LCS        7022/s      --    -71%    -82%    -87%    -87%    -99%
LCSidx    23923/s    241%      --    -40%    -55%    -57%    -98%
LCStiny   39683/s    465%     66%      --    -25%    -29%    -96%
LCSXS     53191/s    657%    122%     34%      --     -4%    -95%
LCSbv     55556/s    691%    132%     40%      4%      --    -94%
S::Sim  1000000/s  14140%   4080%   2420%   1780%   1700%      --
Here the pure Perl LCS::BV beats Algorithm::Diff::XS (written in C).

Worst case

This example shows the weakness of the algorithms against more complicated cases.

my @data3 = ([qw/a b d/ x 50], [qw/b a d c/ x 50]);
$ perl 50_diff_bench.t 
            (warning: too few iterations for a reliable count)
          Rate     LCS LCStiny  LCSidx   LCSbv   LCSXS
LCS     35.2/s      --    -33%    -38%    -97%    -99%
LCStiny 52.7/s     50%      --     -7%    -95%    -98%
LCSidx  56.5/s     60%      7%      --    -94%    -98%
LCSbv   1020/s   2795%   1836%   1705%      --    -67%
LCSXS   3125/s   8766%   5828%   5428%    206%      --
Here the XS is fastest, because the stressed part is in pure C. The bit-vector implementation scales good, because of the parallelisms. The others fall back to worst case performance near the traditional implementation.

Donnerstag, 4. Juni 2015

Loopify Recursions


During development it is sometimes convenient to use recursions. But in Perl recursions don't scale very well, they are consuming memory.

It happened to me, that one of my CPAN modules LCS made problems after extending the  test cases to large sequences (more than 200 elements) as input. Avoiding the warning with 

no warnings 'recursion';

is easy but not a good solution.

Here is the code of the traditional LCS algorithm using recursion. The first subroutine calculates the match matrix $c using a nested loop. At the end it calls _print_lcs() to backtrace the resulting LCS (Longest Common Sequence). See the full source code at github.

sub LCS {
  my ($self,$X,$Y) = @_;

  my $m = scalar @$X;
  my $n = scalar @$Y;

  my $c = [];
  my ($i,$j);
  for ($i=0;$i<=$m;$i++) {
    for ($j=0;$j<=$n;$j++) {
      $c->[$i][$j]=0;
    }
  }
  for ($i=1;$i<=$m;$i++) {
    for ($j=1;$j<=$n;$j++) {
      if ($X->[$i-1] eq $Y->[$j-1]) {
        $c->[$i][$j] = $c->[$i-1][$j-1]+1;
      }
      else {
        $c->[$i][$j] = max($c->[$i][$j-1], $c->[$i-1][$j]);
      }
    }
  }
  my $path = $self->_print_lcs($X,$Y,$c,$m,$n,[]);
  return $path;
}

sub max {
  ($_[0] > $_[1]) ? $_[0] : $_[1];
}

sub _print_lcs {
  my ($self,$X,$Y,$c,$i,$j,$L) = @_;

  if ($i==0 || $j==0) { return ([]); }
  if ($X->[$i-1] eq $Y->[$j-1]) {
    $L = $self->_print_lcs($X,$Y,$c,$i-1,$j-1,$L);
    push @{$L},[$i-1,$j-1];
  }
  elsif ($c->[$i][$j] == $c->[$i-1][$j]) {
    $L = $self->_print_lcs($X,$Y,$c,$i-1,$j,$L);
  }
  else {
    $L = $self->_print_lcs($X,$Y,$c,$i,$j-1,$L);
  }
  return $L;
}

Each recursion can also be implemented as a loop. So it should be possible to rewrite the subroutine _print_lcs() without calling itself.

First let's analyze the control flow. The recursion begins with the maximum values of $i=$m, and $j=$n, the length of the two sequences, which now are also indices of the matching matrix $c, i.e. the subroutine begins in bottom right corner and counts down $i and $j to the upper left corner of the matrix.

Along the way matching points are recorded in the array $L.

The recursive solutions has the convenience, that first the subresult can be calculated, and after that the current match is pushed at the end of the subresults. This way the resulting array is sorted from lower to higher while the recursion tracked back from higher to lower.

The loopified solution also counts down but now has to use unshift for recording the match points. See how simple the code looks now (full version at github):

sub _lcs {
  my ($self,$X,$Y,$c,$i,$j,$L) = @_;

  while ($i > 0 && $j > 0) {
    if ($X->[$i-1] eq $Y->[$j-1]) {
      unshift @{$L},[$i-1,$j-1];
      $i--;
      $j--;
    }
    elsif ($c->[$i][$j] == $c->[$i-1][$j]) {
      $i--;
    }
    else {
      $j--;
    }
  }
  return $L;
}