Page MenuHomeDevCentral

Use PHP 7.4 mb_str_split in OmniString::getBigrams
Closed, WontfixPublic

Description

PHP 7.4 introduced mb_str_split(), which roughly do the same than our for loop in OmniString::getBigrams code.

That means:

public function getBigrams () {
    $bigrams = [];

    $len = $this->len();
    for ($i = 0 ; $i < $len - 1 ; $i++) {
        $bigrams[] = mb_substr($this->value, $i, 2, $this->encoding);
    }

    return $bigrams;
}

Could in PHP 7.4+ rewritten as:

public function getBigrams () {
    return mb_str_split($this->value, 2, $this->encoding);
}

Reference: https://wiki.php.net/rfc/mb_str_split

Event Timeline

dereckson created this task.
dereckson updated the task description. (Show Details)

For non UTF-8 encoding that seems a good idea, but intl extension doesn't provide grapheme_str_split, so we'll probably need to do something like:

Use grapheme but switch to mbstring for non UTF-8
private function canUseGrapheme() : bool {
    return $this->encoding === "UTF-8";
}

public function getBigrams () {
    if (!$this->canUseGrapheme()) {
        return mb_str_split($this->value, 2, $this->encoding);
    }

    $bigrams = [];

    $len = $this->countGraphemes();
    for ($i = 0 ; $i < $len - 1 ; $i++) {
        $bigrams[] = grapheme_substr($this->value, $i, 2);
    }

    return $bigrams;
}

Pitfall: switch only per encoding, not per extension available, to avoid different behavior depending if intl is there or not.

dereckson claimed this task.

The two functions don't solve the same issue:

  • With mb_str_split: "night" -> [ni gh t]
  • With bigrams: "night" -> [ni ig gh ht]

I'm adding a BaseVector::bigrams() as alias for BaseVector::ngrams(2) with the following associated code:

class BaseVector
public function ngrams (int $n) : Vector {
    if ($n < 1) {
        throw new InvalidArgumentException(
            "n-grams must have a n strictly positive"
        );
    }

    if ($n == 1) {
        return Vector::from($this->map(fn ($value) => [$value]));
    }

    $len = $this->count();
    if ($len <= $n) {
        // We only have one slice.
        return Vector::from([$this->items]);
    }

    return Vector::range(0, $len - $n)
        ->map(fn($i) => array_slice($this->items, $i, $n));
}

If you want to generalize the ngrams for a string, call $string->getGraphemes()->ngrams(4) will work out of the box. As will also work $string->getCodepoints()->bigrams().