Last year I visited the Austrian Perl Workshop and also stayed the additional two days on the hackaton. My intention was to discuss details of the long missing NFG (Normalization Form Grapheme) with the core developers, but they spent their time discussing GLR (Grand List Refactor/Redesign).

So I worked on the port of Set::Similarity and Bag::Similarity.

These modules are reference implementations of similarity coefficients known under the names Dice, Jaccard, cosine and overlap.

Input are always two sequences, which are tokenized if necessary, transformed to sets (arrays of unique tokens in Perl6) or bags (hashes in Perl5).

In case of strings as input they are tokenized into single characters by default, or into n-grams if n is specified.

Splitting into single characters is easy in Perl6, if you know that the method is named comb.

```
$ perl6
> say "banana".comb;
b a n a n a
```

Transforming it to a set

```
> say "banana".comb.Set;
set(b, a, n)
```

With sets we can use the intersection operator (a logical and reduction)

```
> say "banana".comb.Set (&) "anna".comb.Set;
set(a, n)
```

But Perl6 understands it shorter

```
> say "banana".comb (&) "anna".comb;
set(a, n)
```

N-grams are not so obvious (but there is more than one way ...)

```
> my $w = 4;'wollmersdorfer'.comb.rotor($w,$w-1).map({[~] @$_})
woll ollm llme lmer mers ersd rsdo sdor dorf orfe rfer
```

The Dice coefficient is defined as count(intersection(A,B)) / (count(A) + count(B)). With the pieces from above we can code:

```
> my $a="banana".comb; my $b="anna".comb; say ($a (&) $b)/($a.Set + $b.Set)
0.4
```

Very short, isn't it? Note, that Perl6 automatically uses the number of elements of a set in the context of numeric operations.

Next the Jaccard coefficient is defined as count(intersection(A,B)) / count(union(A,B)). If the operator for intersection is the logical and, then the operator for the union should be the logical or. Let's try.

Next the Jaccard coefficient is defined as count(intersection(A,B)) / count(union(A,B)). If the operator for intersection is the logical and, then the operator for the union should be the logical or. Let's try.

```
> my $a = 'banana';my $b = 'anna'; my $union = ($a.comb (|) $b.comb)
set(b, a, n)
```

Put the pieces together:

```
> y $a = 'banana';my $b = 'anna'; my $jaccard = ($a.comb (&) $b.comb) / ($a.comb (|) $b.comb);
0.666667
```

Understanding the code for for cosine and overlap is left as an exercise to the reader:

```
my $cosine = ($a.comb (&) $b.comb) / ($a.comb.Set.elems.sqrt * $b.comb.Set.elems.sqrt);
my $overlap = ($a.comb (&) $b.comb) / ($a.comb.Set.elems, $b.comb.Set.elems).min;
```

Is this short code really worth own modules? No, it isn't. Thus I contributed it to Perl6-One-Liners as chapter Text analysis.

Many thanks to the Perl6 hackers answering my questions.

Many thanks to the Perl6 hackers answering my questions.

## Keine Kommentare:

## Kommentar veröffentlichen