Montag, 2. Mai 2022

Tuning Levenshtein Distance

On 2022-01-19 I released an improved version of Text::Levenshtein::BV. BV means Bit Vector, because it uses a bit-parallel algorithm with runtime complexity of O(ceil(m/w)*n), where w is the width of a CPU register. It's 20 times faster than the popular Text::Levenshtein using the traditional or simple algorithm with O(n*n) on strings with length 10.

Last summer I developed a variant with the simple algorithm named Levenshtein::Simple, because working on arrays is more flexible than strings. A string is easy to split into an array of characters. The same module could be used for arrays of characters, graphemes, words, lines.

Another reason is, that the Levenshtein alignment (shortest edit script) is more important than just the distance. Most modules on CPAN calculate only the distance for similarity, which is easier and faster with LCS.

It was a surprise that my Levenshtein::Simple without tuning was 50% faster than Text::Levenshtein. During development of an XS implementation I tuned Levenshtein::Simple to beat Text::Fuzzy::PP.

These are the results for the pure Perl (PP) implementations measured in Rate per second:

                         N=10
Text::WagnerFischer     4,931
Text::Levenshtein       6,339
Text::Fuzzy::PP        11,164
Levenshtein::Simple    27,926
Text::Levenshtein::BV 118,153

There are many XS versions on CPAN for calculating Levenshtein distance. Only three of them work without problems. They all use the simple algorithm. Text::Levenshtein::Flexible uses the C implementation of PostgreSQL, but has tricky code. Perl strings at the XS interface can be bytes or UTF-8. In C UTF-8 can be processed by iterating over UTF-8 or convert UTF-8 to 32-bit integers (numerical code points). Working with integers including the conversion is faster.

Measured via Perl the benchmarks are (TL::BVXS is mine, simple is my implementation just for fun, uni is with Unicode strings, bytes with bytes):

                            N=10
Text::Levenshtein::XS    347,539
TL::Flexible           2,725,258
Text::Fuzzy            3,026,401
TL::BVXS_simple        6,056,132
TL::BVXS_uni           8,495,407
TL::BVXS_bytes        12,743,110

The same simple algorithm, also implemented in C is 2 times faster than that of PostgreSQL. BV is 4 times faster (on longer strings the difference is larger). Compared to Text::Levenshtein it's 2000 times faster.

In pure C without the XS interface distance on bytes works at a rate of 25 Mio./second, codepoints ~10 % slower.

Optimising for fun.

Montag, 25. April 2022

Benchmark UTF-8 decoding in Perl

In Perl there are two modules on CPAN for UTF-8 decoding:

- Encode

- Unicode::UTF-8

Unicode::UTF-8 claims to be faster. I was interested how fast it is compared to my own implementation in pure C to decode utf8 (1-6 byte, needing 31 bits) into 32-bit code points.

UTF-8 became a bottleneck and is worth to have a look at. New SIMD implementations can validate up to 4 GB UTF-8 per second with AVX512.

That's the result of a quick benchmark:

$octets: Chſerſplzon chars: 11 bytes 13
                    Rate        Encode Unicode::UTF8      TL::BVXS
Encode         1927529/s            --          -83%          -89%
Unicode::UTF8 11143773/s          478%            --          -36%
TL::BVXS      17311395/s          798%           55%            --

$octets: राज्यराज्य chars: 9 bytes 27
                    Rate        Encode Unicode::UTF8      TL::BVXS
Encode         1592888/s            --          -83%          -90%
Unicode::UTF8  9466311/s          494%            --          -42%
TL::BVXS      16287053/s          922%           72%            --

Mine is fastest with ~215 MB/s but still is far away from SIMD solutions. In my use case the decoding consumes 35 % of the execution time. But SIMD would not help much for short strings.

Trial: port some code to "use standard"

Just wanted to know how it feels to adapt code to use standard;

I selected a small, unpublished module: Levenshtein::Simple. It's in the range of 530 lines of code without tests. A typical reinvent a better wheel, because Text::Levenshtein does not support arrays as input and misses an alignment method, which I needed in a simple and robust way.

Here the changes as diff:

49 changes, most of them explicit syntax for dereferencing.

2 barewords needed quoting: STDOUT, STDERR. But why are they there? They are not needed at all.

The nested ternary needed explicit braces. Now it's better readable.

1 LABEL not allowed as bareword. That's a bug of Guacamole. The label was a relict.

2 Invalid characters in the pod. Copy and paste of a reference. use standard; slurps the source as UTF-8 and it breaks. That's undocumented and the second bug.

In summary simple changes and IMHO better readable now. Nice development tool but not ready for production.

Dienstag, 19. April 2022

Unicode, UTF-8, utf8 and Perl-XS revisited

Are you sure knowing everything about Unicode in Perl? You never can.

Imagine doing everything right in your code, use a UTF-8 boilerplate, use Test::More::UTF8 and then get a 'wide character' warning and the test fails.

What happend?

Boiling it down to diagnostic code, it has something to do with the utf8-flag:

#!perl

use strict;
use warnings;
use utf8;

binmode(STDOUT,":encoding(UTF-8)");
binmode(STDERR,":encoding(UTF-8)");

my $ascii  = 'abc';
my $latin1 = 'äöü';
my $uni    = 'Chſ';

print '$ascii:  ',$ascii,' utf8::is_utf8($ascii): ',utf8::is_utf8($ascii),"\n";
print '$latin1: ',$latin1,' utf8::is_utf8($latin1): ',utf8::is_utf8($latin1),"\n";
print '$uni:    ',$uni,' utf8::is_utf8($uni): ',utf8::is_utf8($uni),"\n";

my $file = 'utf8-flag-ascii.txt';

my $ascii_file;
open(my $in,"<:encoding cannot="" die="" file:="" file="" line="<$in" my="" open="" or="" while="">) {
    chomp($line);
    $ascii_file = $line;
}
close($in);

print '$ascii_file: ',$ascii_file,
    ' utf8::is_utf8($ascii_file): ',utf8::is_utf8($ascii_file),"\n";

This prints:

# with 'use utf8;'

$ascii:  abc utf8::is_utf8($ascii):
$latin1: äöü utf8::is_utf8($latin1): 1
$uni:    Chſ utf8::is_utf8($uni): 1
$ascii_file: abc utf8::is_utf8($ascii_file): 1

Without use utf8:

# without 'use utf8;'

$ascii:  abc utf8::is_utf8($ascii):
$latin1: Ã¤Ã¶Ã¼ utf8::is_utf8($latin1):
$uni:    ChÅ¿ utf8::is_utf8($uni):
$ascii_file: abc utf8::is_utf8($ascii_file): 1

That's known and expected. Perl doesn't set the utf-flag under use utf8 for ASCII string literals.

Let's have a look into the XS interface if we want to process strings in C:

int
dist_any (SV *s1, SV *s2)
{
    int dist;

    STRLEN m;
    STRLEN n;
    /* NOTE:
        SvPVbyte would downgrade (undocumented and destructive)
        SvPVutf8 would upgrade (also destructive)
    */
    unsigned char *a = (unsigned char*)SvPV (s1, m);
    unsigned char *b = (unsigned char*)SvPV (s2, n);

    if (SvUTF8 (s1) || SvUTF8 (s2) ) {
        dist = dist_utf8_ucs (a, m, b, n);
    }
    else {
        dist = dist_bytes (a, m, b, n);
    }
    return dist;
}

With two strings involved we have a problem, if one has the utf-flag set and the other not. With a constraint 'SvUTF8 (s1) || SvUTF8 (s2)' both would be treated as bytes, even in the case that one is utf8 and the other a string literal in the ASCII range (which is also valid UTF-8). In combination with SvPVbyte the utf8 literal would be downgraded. That caused 'wide character' warning, because SvPVbyte changes the original string. The new code is not correct, because it could treat some other encoding as UTF-8. But I decided to be harsh and let the user run into his own problems (not using best practises).

The inconsequent treatment appears if we decode as UTF-8 from a file. Then the utf8-flag is also set for strings in the ASCII range. A brute force comparison of an English and a German dictionary, each 50,000 words resulted in 292,898,337 calls and the script needed 102 seconds runtime. That's a rate of ~3 M/sec. and slow, because it always needs to decode UTF-8, even if 100% of the English words are in the ASCII range. The nice lesson: With a small change in the code the decoding got faster with an overall acceleration of 13% via Perl and 19% in pure C. But the decoding still consumes 35% of the runtime.

Dienstag, 29. März 2022

First impression of "use standard"

The idea of the Perl feature standard is to check if Perl code conforms to a reduced and easier to parse syntax.

If one likes writing explicit syntax it's very near to standard.

It's part of Guacamole and uses Marpa::R2 as a parser. Looking into the source there is a large definition of the Perl syntax. Guacamole fails in many tests on CPAN-Testers, mainly on older Perl versions. This seems because Guacamole uses postfix dereferencing.

Installation via cpanm worked without problems. Same for Marpa::R2.

For a quick try I used one of the test scripts open my editor, 356 lines of code.

First modification is to activate the feature:

use standard;

Next we start the script:


$ perl t/10_basic_distance.t 
File 't/10_basic_distance.t' does not pass Standard Perl.
Parser says:
> Error in SLIF parse: No lexeme found at line 10, column 16
> * String before error: gs;\nuse utf8;\n\nuse standard;\n\nbinmode( STDOUT
> * The error was at line 10, column 16, and at character 0x002c ',', ...
> * here: , ":encoding(UTF-8)");\nbinmode( 'STDERR', ":encod
> Marpa::R2 exception 
    at ~/perl5/perlbrew/[...]/5.32.0/Guacamole.pm line 2103.
> 
> Failed to parse past: STDOUT (char 17, length 1), 
    expected LParen,OpArrow,PackageSep 
    at ~/perl5/perlbrew/[...]/5.32.0/Guacamole.pm line 2119.

Hmm, looks a little hard to read. But we can find the by line-number and position.

Seems it doesn't like STDOUT as a bareword parameter. That's against the documentation allowing STDOUT as a bareword. No problem, just quote it:

-binmode(STDOUT,":encoding(UTF-8)");
-binmode(STDERR,":encoding(UTF-8)");
+use standard;
+
+binmode( 'STDOUT', ":encoding(UTF-8)");
+binmode( 'STDERR', ":encoding(UTF-8)");

Next run, next problem. It finds 3 "forgotten" dereferences without curly brackets. That's fine and exactly what's expected:

-  for my $example (@$examples1) {
+  for my $example (@{$examples1}) {

At least it wants explicit syntax on subroutine calls:

-done_testing;
+done_testing();

In total 6 of 356 lines had to change.

In summary it does a good job and is fast, especially compared to Perl::Critic. But the start time increases to 0.727s compared to 0.069s without it on a system with SSD. Better remove it before release to production.

Maybe it's a better way to wrap the module in a small test method, which reads all Perl files of a source tree and checks the quality.

Sonntag, 14. Februar 2016

Credit the first or last uploader on meta::cpan?

Credit problem

This post is related to the post the "credit the last uploader" problem by RJBS.

A related thread on twitter caused IMHO more misunderstandings than solutions. Twitter is not the best medium to discuss complex problems.

RJBS still did a deep analysis in the above referenced post.

In summary I agree mostly with the opinions. But the suggested solution

MetaCPAN Web should stop showing the name and image of the latest uploader as prominently, and should show the first-come user instead.

isn't IMHO an improvement. It solves a problem and creates a new one.

Persons in the life cycle

At the minimum a distribution on CPAN has an author and an uploader. In many cases the author is the same person as the uploader. Maybe the distribution has users. Also there can be persons submitting issues or patches.

Distributions can also have co-maintainers, persons with the permission to upload new versions of the distribution.

Persons without permission can upload non-authorized versions. They are available but marked as non-authorized.

During the life of a distribution the maintainers (people with upload permission) can change.

In my opinion everybody contributing to a distribution should be attributed, including submitters of issues. Even bad contributions which are removed later improve the quality. In free software with unpaid developers honor and attribution is important for motivation.

Prominent person

The release page of a distribution on meta::cpan shows in the top right corner name and image of one person, the last uploader.

Now RJBS means that the first uploader of a distribution is more important.

Maybe, if the first uploader is still very active. But what does "very active" mean and how cam meta::cpan measure it? It can't.

Imagine distributions first uploaded 15 years ago. Meanwhile maintainers changed, features where added, code completely rewritten. Which person should be displayed?

Only one

If only one can be displayed, who should it be?

In my role as user I prefer the last uploader. For me it is important to get a feeling of quality, which is best associated with a picture and a name of a person. This sounds like pre-justice, and it is. Experience and trust is a main factor of human decisions. The picture and name of a person work like a brand of a computer manufacturer.

But if person A writes most of the code and person B is the last uploader, shouldn't A be displayed? No, as a user I prefer B, because the last uploader did the last quality check.

Solution

Let's look at very successful solutions like github. Github allows repositories owned by organisations (all members have permissions) or single persons. Organisations are not possible on PAUSE without change.

Another idea inspired from github could be to display miniature pictures of all persons. Github displays all contributors. In case of CPAN this could be all current persons with upload permissions. I prefer this solution.

If there is no better solution, nothing should be changed. The change suggested by RJBS can harm the motivation of maintainers, some could feel it like a slap into the face.

Anyway there should be broad consensus and not a decision by one or a few persons.

Edit: Some hours after finishing this post I detected a similar post by Neil Bowers: It takes a community to raise a CPAN module.

Dienstag, 19. Januar 2016

Porting LCS::BV to Perl 6 - a fast core loop for Algorithm::Diff

Christmas happened and Perl6 is now reasonable stable. It's time to use it for real problems, learn more about pros and cons, detect the elegance.

Porting my CPAN distribution LCS::BV is a nice task. It's short code, has full test coverage and allows benchmarks. Also it could be expected that LCS::BV (a bit-vector implementation of the Longest Common Subsequence algorithm) will be faster than the Perl6 port of Algorithm::Diff and the Perl6 module Algorithm::LCS. LCS::BV is as fast as the XS version of A::D, see my blogpost Tuning Algorithm::Diff.

Test dependencies

LCS algorithms are complex beasts. It's nearly impossible to implement them without good test cases. One problem is, that these algorithms find a sequence of maximum length, but only one of maybe more possible solutions. All possible solutions are provided by the slow CPAN distribution LCS and the method allLCS() as explained in a another blog post. So I need to port this to Perl 6 first.

To compare the results during tests Test::Deep is needed. Should be no problem, sure somebody already ported it. Or not? No, there is no Test::Deep on modules.perl6.org. So I need to port this too, or the part needed. Or not?

This is the relevant part of the tests in Perl 5:

use Test::More;
use Test::Deep;

use LCS;
use LCS::BV;

cmp_deeply(
    LCS::BV->LCS(\@a,\@b),
    any(@{LCS->allLCS(\@a,\@b)} ),
    "$a, $b"
);

done_testing;

Trying to write a subroutine from bottom I got the following solution using any():


# say so $[[2,3],[2,4],] eqv $[[[0,1],[2,3],],[[0,1],[2,4],]].any;
  
my sub any_of ($result, $expected) {
  return so $result eqv $expected.any;
}
  
ok(
  any_of(
    LCS::BV::LCS($A, $B),
    LCS::All::allLCS($A, $B),
  ),
  "'$a', '$b'",
);

That's an example how many features of Perl 6 are worth the half of CPAN.

Length of an integer

In Perl 5 bit operations are limited to length of an integer. In most implementations this is 64 bits, but can be 32 bits. If an array is mapped to bits it needs ceil(array_length/integer_width) integers.

This needs the following code in Perl 5:


our $width = int 0.999+log(~0)/log(2);

use integer;
no warnings 'portable'; # for 0xffffffffffffffff

my $positions;

# $a ... array of elements, e.g. ['f','o','o']

$positions->{$a->[$_]}->[$_ / $width] |= 1 << ($_ % $width) 
    for $amin..$amax;

# gives {'f' => [ 0b001 ], 'o' => [ 0b110 ], }

Doing this the first time in Perl 6, I wondered how long integers are and how to find out. Took me very long with the result: infinite. That's awesome.

This part ported to Perl 6:


my $positions;
$positions{$a[$_]} +|= 1 +< $_  for $amin..$amax;

This is approximately half the code. You can imagine that the rest of the code also gets simpler. It also is the half.

Just 30 lines

For rosettacode.org I adapted the code for strings and removed the usual prefix and suffix optimization, see Bit Vector:


sub lcs(Str $xstr, Str $ystr) {
    my ($a,$b) = ([$xstr.comb],[$ystr.comb]);

    my $positions;
    for $a.kv -> $i,$x { $positions{$x} +|= 1 +< $i };

    my $S = +^0;
    my $Vs = [];
    my ($y,$u);
    for (0..+$b-1) -> $j {
        $y = $positions{$b[$j]} // 0;
        $u = $S +& $y;
        $S = ($S + $u) +| ($S - $u);
        $Vs[$j] = $S;
    }

    my ($i,$j) = (+$a-1, +$b-1);
    my $result = "";
    while ($i >= 0 && $j >= 0) {
        if ($Vs[$j] +& (1 +< $i)) { $i-- }
        else {
            unless ($j && +^$Vs[$j-1] +& (1 +< $i)) {
                $result = $a[$i] ~ $result;
                $i--;
            }
            $j--;
        }
    }
    return $result;
}

You can install LCS::BV and LCS::All via Panda or clone them from github.

Summary

Perl 6 has a lot of convenience, allows short and clear code. It's huge. But learning all the nice features is easier than I thought. Understanding e.g. any() and trying it out in the REPL is a magnitude faster, than searching in CPAN, reading docs and trying out some distributions.