Thursday, May 12, 2022

Trie Data Structure for string deduplication

Introduction


In Leetcode weekly contest problem, we need to compute the number of unique subarrays of an array nums that match following condition:

  • At most k elements in nums[i..j] are divisible by p, with 0 <= i <= j < n.

Because the same subarray sequence can occur in multiple places in the array, how do we avoid double-counting them?


Unit Test


import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.Test;

public class KDivisibleElementsSubarraysTest {
KDivisibleElementsSubarrays KDivisibleElementsSubarrays = new KDivisibleElementsSubarrays();

@Test
public void solution() {
int[] nums = new int[] { 2, 3, 3, 2, 2 };
int k = 2;
int p = 2;
Assertions.assertEquals(11, KDivisibleElementsSubarrays.countDistinct2(nums, k, p));
}
}


Brute-Force solution


We can cache the subarrays in a Set<List<Integer>>. For each subarray, we need to go through the whole subarray to copy it in a list and then compute it's hash. It results an O(N^3) complexity. 


import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.stream.Collectors;
import java.util.stream.IntStream;

public class KDivisibleElementsSubarrays {

public int countDistinct(int[] nums, int k, int p) {
int n = 0;
Set<List<Integer>> s = new HashSet<>();
for (int i = 0; i < nums.length; i++) {
for (int j = i; j < nums.length; j++) {
List<Integer> l = subarray(nums, i, j);
if (s.contains(l)) {
continue;
}

if (divisible(l, k, p)) {
n++;
}

s.add(l);
}
}
return n;
}

private List<Integer> subarray(int[] nums, int i, int j) {
return IntStream.range(i, j+1)
.map(k -> nums[k])
.boxed()
.collect(Collectors.toList());
}

private boolean divisible(List<Integer> l, int k, int p) {
return l.stream()
.filter(i -> (i % p) == 0)
.count() <= k;
}
}



String matching algorithm


Rabin-Karp algorithm


The editorial suggests rolling hash from Rabin-Karp String matching algorithm.


For each subarray of a fixed size, we are able to recompute hash code in just O(1) time. We perform following steps to update the hash code:

  • remove the leftmost character from the previous subarray
  • shift the middle characters that are in common
  • add the rightmost character from the next subarray

2 scenarios can happen after looking up the hash code:


  1. Favorable case: Hash code does not match with any other subarrays visited so far. It is necessary that hash codes match if subarrays are the same. So we can move on to the next subarray in O(1) time after assessing a True Negative.

  2. Adversarial case: Hash code matches with an already cached subarray's hash code. But subarrays may have different contents while their hash match. It can be a False Positive. This is because hash function is not injective. We need to compare the 2 subarrays character by character in O(N) time to make sure content is the same.


For more info, refer to CLRS chapter 32, String Matching and section 32.2The Rabin-Karp algorithm.


String matching algorithms


Given a String of size n and a Pattern of size m, with characters from an Alphabet of size |Σ|,  we have following runtime complexity table:


Algorithm          | Preprocessing | Matching

Brute-force        | 0             | O((n-m+1)*m)
Rabin-Karp         | O(m)          | O((n-m+1)*m)
Finite automaton   | O(m*|Σ|)      | O(n)
Knuth-Morris-Pratt | O(m)          | O(n)



Trie Data Structure


To reduce complexity from O(N^3) down to O(N^2), we leverage a trie data structure. It's a tree where each node has only 1 character as value. It potentially has |Σ| children, for each character of the alphabet.


Like in a suffix tree, to build the Trie, we sequentially insert each suffix [i..n-1], 0 <= i < n in the trie root node.

In trie diagram below, some nodes were marked with the beginning index for each suffix from the original string, "23322".



Each subarray [i..j] is a prefix of suffix [i..n-1]. Walking the trie will decide whether the subarray starting at the trie root and ending in the current node should be added or marked as duplicate.


package leetcode.weekly291;

import java.util.HashMap;
import java.util.Map;

public class KDivisibleElementsSubarrays {

static class Trie {
Map<Integer, Trie> children = new HashMap<>();
}

private final Trie root = new Trie();
public int countDistinct(int[] nums, int k, int p) {
int n = 0;
for (int i = 0; i < nums.length; i++) {
n += suffixSubarrays(nums, i, k, p);
}
return n;
}

private int suffixSubarrays(int[] nums, int start, int k, int p) {
int n = 0;
Trie current = root;
int divisible = 0;
for (int i = start; i < nums.length; i++) {
if (nums[i] % p == 0) {
divisible++;
}
if (divisible > k) {
break;
}

Trie child;
if (current.children.containsKey(nums[i])) {
// subarray at [start ... i] is deduplicated
child = current.children.get(nums[i]);
} else {
if (divisible <= k) {
n++;
}
child = new Trie();
current.children.put(nums[i], child);
}
current = child;
}
return n;
}
}


Conclusion


The problem of finding distinct subarrays is an example of string matching. We can use a rolling hash to find potential matches. This works well for inputs with a small number of matches. Due to collision checking, worst-case complexity becomes the same as naive approach.

As an alternative, we can leverage a Trie data structure to deduplicate strings. To build it, we insert each suffix of the original string. We match the longest prefix in common between current suffix and the already inserted suffixes. This allows to process each subarrays starting at a given index at once by just processing the suffix starting at the same index.