Page MenuHomeDevCentral

Revisit Sørensen–Dice coefficient
Open, Needs TriagePublic

Description

The English Wikipedia article Sørensen–Dice coefficient gives an EXAMPLE of application for strings using bigrams.

This is what's currently implemented in D2052.

https://pganalyze.com/blog/similarity-in-postgres-and-ruby-on-rails-using-trigrams uses trigrams instead of bigrams and give more weight to word start by padding spaces to the strings. Such approach is implemented in PostGreSQL and Rails.

We should determine if we can improve our Sørensen–Dice code switching to such trigrams.

Event Timeline

dereckson moved this task from Backlog to Dev on the good-first-issue board.
dereckson moved this task from Backlog to Feature requests on the Keruald board.

bf7c0c3a38a2 introduced BaseVector::ngrams($count), so we could allow the class to use any n when strings are divided in n-grams.