Tech Vineyard: April 2022

Friday, April 29, 2022

String parsing with Finite State-machine

Introduction

Problem C in Codeforces Educational Codeforces Round 125 describes a string processing process with following rules

characters only include '(' and ')'
find shortest prefix that is either

a palindrome
a regular sequence: it should be well-formed, like an arithmetic expression

repeat parsing process

For a regular sequence, we could maintain a counter. We increment it on opening brackets and decrement on closing brackets. If it reaches 0 after being positive only it is a valid prefix matching a regular sequence. If it becomes negative before reaching 0, sequence is not regular.

For a palindrome, naive palindrome checking complexity is too high, O(N) for a general alphabet on a string of size N, as we need to compare N/2 characters.

We need to leverage the binary & shorted prefix properties of the problems.

Running a case disjunction lets us discover properties of the output.

Finite State-machine

We look at all the possibilities for the first few characters. We draw a decision tree to develop an intuition on parsing logics.

We can then assign a state to every possible pattern of minimal prefix being either a palindrome or regular. Here's the finite state machine version:

See editorial for final solution.

Conclusion

For string parsing problem, it helps to visualize the simple cases via a decision tree. We then expand it to get a more exhaustive picture to include most cases. We can leverage compiler theory / Finite State Machine to extract the possible states. For validation purpose we run a sample input against the FSM to see how the match progresses as input is parsed.

Annex

Dot file for decision tree

digraph G {

rankdir="LR"

Start -> O1;

Start -> C1;

Start -> E1;

O1 -> O2;

O1 -> C2;

C1 -> OStar;

OStar -> C3;

OStar -> E2;

Start [label="^"];

O1, O2 [label="("];

C1, C2, C3 [label=")"];

E1, E2 [label="$"]

OStar [label="(*"];

Dot file for finite state machine

digraph FSM {

rankdir=LR;

Invisible -> Start;

Start -> S1[label="("];

S1 -> R[label=")"]

S1 -> P2[label="("]

R -> Start[style="dashed"]

P2 -> Start[style="dashed"]

Pn -> Start[style="dashed"]

Start -> P1[label=")"];

P1 -> P1[label="("]

P1 -> Pn[label=")"]

Start -> End[label="EOF"]

S1 -> End[label="EOF"]

P1 -> End[label="EOF"]

Invisible[style=invis];

Start[label="^"];

End[label="$", shape=doublecircle];

}

Friday, April 22, 2022

Prefix sum vector & caching

Introduction

In Codechef GEOMMEAN problem, we need to find the number of subarrays [l, r] of an array A with following property:

A_l * ... * A_r = X^(r-l+1)

This article is about a 1st approach bound to fail, but on the right track.

It is an application of prefix sum + caching. This method usually applies to an array of integers, in 1 dimension. But it actually extends easily to a multi-dimension use case where multiple arrays of integers are required.

Input sample

3
3 3
3 3 3
4 4
1 2 3 4
4 54
36 81 54 54

Output sample

6
1
6

Input parsing

public static void main(String[] args) throws Exception {
    //InputStream inputStream = System.in;
    InputStream inputStream = new FileInputStream("GEOMMEAN");
    BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream));
    PrintWriter writer = new PrintWriter(new BufferedOutputStream(System.out));

    String[] tokens;
    tokens = bufferedReader.readLine().split(" ");
    int T = Integer.parseInt(tokens[0]);
    KostomukshaAndAESCMSU kostomukshaAndAESCMSU = new KostomukshaAndAESCMSU();
    while (T > 0) {
        tokens = bufferedReader.readLine().split(" ");
        int N = Integer.parseInt(tokens[0]);
        int X = Integer.parseInt(tokens[1]);
        tokens = bufferedReader.readLine().split(" ");
        int[] A = new int[N];
        for (int i = 0; i < N; i++) {
            A[i] = Integer.parseInt(tokens[i]);
        }

        writer.println(kostomukshaAndAESCMSU.subArrays(N, X, A));
        T--;
    }
    writer.close();
    inputStream.close();
}

Brute force solutions

As benchmarks, following solutions work in O(N^2)

With BigInteger

Use BigInteger to work with arbitrary size numbers,

public long subArrays2(int N, int X, int[] A) {
    long subArrays = 0;
    BigInteger x = BigInteger.valueOf(X);

    for (int L = 0; L < N; L++) {
        BigInteger product = BigInteger.ONE;
        BigInteger power = BigInteger.ONE;
        for (int R = L; R < N; R++) {
            product = product.multiply(BigInteger.valueOf(A[R]));
            power = power.multiply(x);

            if (product.equals(power)) {
                subArrays++;
            }
        }
    }
    return subArrays;
}

With modular arithmetics

Multiplying int numbers instead of BigInteger has less overhead. It should run faster. Not sure if this approach is correct ...

 private static final int MOD = 1_000_000_007;
 ModularArithmetics modularArithmetics = new ModularArithmetics(MOD);
 public long subArrays3(int N, int X, int[] A) {
   long subArrays = 0;

   for (int L = 0; L < N; L++) {
      int product = 1;
      int power = 1;
      for (int R = L; R < N; R++) {
         product = modularArithmetics.multiply(product, A[R]);
         power = modularArithmetics.multiply(power, X);

         if (product == power) {
            subArrays++;
         }
      }
   }
   return subArrays;
}

Log transform, a failed approach

We apply a logarithmic transform to turn the product into a sum

Sum(log(A_i), i = l ... r) = (r-l+1) log(X)

Call Math.log and cache the values in a Map<Double, Integer>

public long subArrays1(int N, int X, int[] A) {
   long subArrays = 0;

   Map<Double, Integer> y_1 = new HashMap<>();
   y_1.put(0D, 1);

   double s = 0;
   for (int i = 0; i < N; i++) {
      s += Math.log(A[i]);
      double y = s - (i+1) * Math.log(X);
      int subArrays_i = y_1.getOrDefault(y, 0);
      subArrays += subArrays_i;
      y_1.put(y, subArrays_i+1);
   }

   return subArrays;
}

In above implementation, I projected the values on x=-1 axis as normalization. I get following invalid output on the sample input.

6
0
6

Looking at the map

y_1 = {HashMap@482}  size = 5
 {Double@493} 0.0 -> {Integer@494} 1
 {Double@495} -1.3862943611198906 -> {Integer@494} 1
 {Double@496} -2.367123614131617 -> {Integer@494} 1
 {Double@497} -2.0794415416798357 -> {Integer@494} 1
 {Double@498} -2.3671236141316165 -> {Integer@494} 1

we see a duplicated key because of precision error. In subArrays1 highlighted line of code, multiplication by i+1 caused the discrepancy.

Instead of computing prefix sum only, we calculate the prefix difference sum:

public long subArrays2(int N, int X, int[] A) {
    long subArrays = 0;

    Map<Double, Integer> y = new HashMap<>();
    y.put(0D, 1);

    double s = 0;
    for (int i = 0; i < N; i++) {
        s += Math.log(A[i]) - Math.log(X);
        int subArrays_i = y.getOrDefault(s, 0);
        subArrays += subArrays_i;
        y.put(s, subArrays_i+1);
    }

    return subArrays;
}

This returns the same map, without the precision error

y = {HashMap@482}  size = 4
 {Double@492} 0.0 -> {Integer@493} 1
 {Double@494} -1.3862943611198906 -> {Integer@493} 1
 {Double@495} -2.0794415416798357 -> {Integer@493} 1
 {Double@496} -2.3671236141316165 -> {Integer@497} 2

But this variant eventually fails as well. Here's the counter example search

public void generateTests() {
    Random random = new Random();
    int N = 3;
    int MAX = 10;
    while (true) {
        int[] A = new int[N];
        for (int i = 0; i < N; i++) {
            A[i] = 1 + random.nextInt(MAX);
        }
        int X = 1 + random.nextInt(MAX);
        long subArrays2 = subArrays2(N, X, A);
        long subArrays3 = subArrays3(N, X, A);
        if (subArrays2 != subArrays3) {
           System.err.println(String.format("%d != %d %d %s", subArrays2, subArrays3, X, Arrays.toString(A)));
           break;
        }
    }
}

On this input, subArrays2 returned 0 subarrays while there are 2 matches.

1
3 6
9 4 9

y = {HashMap@477}  size = 4
 {Double@490} 0.0 -> {Integer@491} 1
 {Double@492} 2.220446049250313E-16 -> {Integer@491} 1
 {Double@493} 0.40546510810816483 -> {Integer@491} 1
 {Double@494} 0.4054651081081646 -> {Integer@491} 1

log approach is affected by 2 risks:

false negatives: 2 values don't end-up in the same bucket while they should. For example they evaluate to 0D and -0D, which have different hashCode.
false positives: 2 values end-up in the same bucket while they should not. They are different values, but because of finite precision they end up being rounded to the same floating number.

Prime Factorization, an exact approach

We now work with exact values. Prime Factorization does a similar transform as log function, without the loss of precision.

Reduced problem set

Assume X only have 1 single prime factor p,

X = p^e

you would discount prime factor exponent e instead of log(X) in prefix difference sum. Note it's similar to this formula, for any real number X

X = exp(log(X))

The subarray is solution of

Sum(e_i, i = l ... r) = (r-l+1) e

We would use Map<Integer, Integer> instead of Map<Double, Integer>.

Generalization to vector

Consider X prime factorization

X = Product(p_j ^ e_j, j = 1 ... n)

One key fact is: If A_i prime divisor set is different than X's, it can not be a member of the candidate subarray. We discard it and reset the cache when traversing array A.

We can reduce problem set to integers in the form

A_i = Product(p_j ^ e_ij, j = 1 ... n)

Prime factorization outputs an exponent vector. The tuple size is the number n of prime factors in X.

e = (e_1, ..., e_n)

For each i = 1 ... N

e_i = (e_i1, ..., e_in)

The subarray is solution of n equations:

For each j = 1 ... n

Sum(e_ij, i = l ... r) = (r-l+1) e_j

The equation is the same as above, but with vectors instead of scalars

Sum(e_i, i = l ... r) = (r-l+1) e

Caching and counting works the same, using Vector keys instead of Integer.

Implementation details

See prime factorization implementation. Computing it for an integer a runs in O(a^(1/2)), which is still polynomial. We don't need to compute every array element's prime factorization. We just compute the exponents associated to X prime factorization, in O(n), where n is the number of prime factors of X. This is constant time instead

We can use List<Integer> to represent the vector. I opted for a custom, fixed size, immutable POJO, backed by int array.

class ExponentVector {
    private final int[] exponents;
    public ExponentVector(int n) {
       exponents = new int[n];
    }

    public ExponentVector(int[] exponents) {
       this(exponents.length);
       System.arraycopy(exponents, 0, this.exponents, 0, exponents.length);
    }

    @Override
    public boolean equals(Object o) {
       ExponentVector other = (ExponentVector) o;
       return Arrays.equals(exponents, other.exponents);
    }

    @Override
    public int hashCode() {
       return Arrays.hashCode(exponents);
    }

    public int size() {
       return exponents.length;
    }
}

Refer to Codechef submission for full implementation.

PrimeFactorization primeFactorization = new DivisorPrimeFactorization();
public long subArrays5(int N, int X, int[] A) {
    List<PrimeFactor> xFactors = primeFactorization.factors(X);

    ExponentVector x = exponents(xFactors);

    int n = xFactors.size();

    ExponentVector s;
    ExponentVector zero = new ExponentVector(n);

    Map<ExponentVector, Integer> y_1 = new HashMap<>();
    y_1.put(zero, 1);
    s = zero;

    long subArrays = 0;
    for (int i = 0; i < N; i++) {

       int[] exponents = new int[n];
       for (int j = 0; j < n; j++) {
          PrimeFactor primeFactor = xFactors.get(j);
          int prime = primeFactor.getPrime();
          int exponent = 0;
          while (A[i] % prime == 0) {
             A[i] /= prime;
             exponent++;
          }
          exponents[j] = exponent;
       }

       if (A[i] != 1) {
          y_1.clear();
          y_1.put(zero, 1);
          s = zero;
          continue;
       }

        ExponentVector a = new ExponentVector(exponents);
        s = subtract(add(s, a), x);

        int subArrays_i = y_1.getOrDefault(s, 0);
        subArrays += subArrays_i;
        y_1.put(s, subArrays_i+1);
    }

    return subArrays;
}

Conclusion

Prefix sum + map is very handy in this problem. It does not work with approximate values stored as Double because of precision issues. But it works for exact values stored as Integer. When multiple but fixed-size constraints are required, we can extend the approach to vector.

Prime factorization simple algorithm runs in O(N^(1/2)).

Finally, you need to extract necessary conditions as a criteria to

simplify the problem
reduce the scope of the input
reduce runtime complexity, from O(N^2) down to O(N).

Monday, April 18, 2022

Probability computation via decision tree

Introduction

In SingleOrDouble problem from TCO 2022 Round 1A, we want to compute the probability for Alice to win a game.

The game is played via following rules

we generate a sequence of stages. For each stage, the result is the sum of N dice thrown at random. They all have D faces.
In this experiment, there are 2 events in the sample space

Alice wins as soon as the number is A
Bob wins as soon as the number is B twice in a row

Let WA be a binary random variable which value is the outcome of the experiment

WA=1 if Alice wons
WA=0 if Alice looses

As a note, we are interested in the expected value ...

E[WA] = p[WA=0] * 0 + p[WA=1] * 1
E[WA] = p[WA=1]

Probability 101

Notice the terms in bold in the introduction. I have reused the terms outlined in the 1st video lecture in Probabilistic Systems Analysis and Applied Probability MIT course. It is available for free.

Probability distribution for an individual event

The generated numbers M_i in each round i are independent random variables. They follow the same discrete probability distribution of a random variable M. Valid values range from 1*N (all dice land on 1) to D*N (all dice land on D).

We define array S as the count array for the possibilities for N dice to sum up to any values n, n = N ... D*N.

We get the statistics associated to outcome n using N dice, based on the last die value d and the statistics associated to n-d using N-1 dice.

S[N][n] = Sum(S[N-1][n-d], d = 1 ... D)

Dynamic programming

We can compute the statistics array via DP, with a single array in O(D*N^2).

int[] S = new int[D*N+1];
S[0] = 1;
for (int i = 1; i <= N; i++) {
   for (int s = i*D; s >= 0; s--) {
      S[s] = 0;
      for (int d = 1; d <= D; d++) {
         if (s-d < 0 || (i > 1 && s-d == 0)) {
            continue;
         }
         S[s] += S[s-d];
      }
   }
}

int sum = 0;
for (int s = N; s <= N*D; s++) {
   sum += S[s];
}

double pA = 1D * S[A] / sum;
double pB = 1D * S[B] / sum;

Below is the sum count for D=6, N = {1, 2, 3}

We divide by the total number of outcomes to calculate the probability:

p[M=n] = S[n] / Sum(S[s], s = N ... D*N)

See editorial for an alternative DP computation of the probability.

Decision tree

We look at the first 2 numbers N1 and N2 in the sequence. This makes sense because Bob wins after 2 consecutive numbers equal to B.

We break down the set of all individual outcomes into 3 disjoint subsets or events:

A
B
A∪B

Notice they are

Mutually exclusive
Collectively exhaustive

Decision tree for the 1st 2 stages in the sequence based on the 3 possible events is

There are 5 leaves in the above tree. We can write Ω as a disjoint union of 5 events.

N1=A
N1=B,N2=A
N1=B,N2=B
N1=B,N2=A∪B
N1=A∪B

Ω = N1=A ∪ N1=B,N2=A ∪ N1=B,N2=B ∪ N1=B,N2=A∪B ∪ N1=A∪B

The winning event can then be partitioned into

WA=1 = WA=1 ∩ Ω

= WA=1 ∩ (N1=A ∪ N1=B,N2=A ∪ N1=B,N2=B ∪ N1=B,N2=A∪B ∪ N1=A∪B)

= (WA=1 ∩ N1=A) ∪

(WA=1 ∩ N1=B,N2=A) ∪

(WA=1 ∩ N1=B,N2=B) ∪

(WA=1 ∩ N1=B,N2=A∪B) ∪

(WA=1 ∩ N1=A∪B)

The winning event probability is now the sum of each associated event probabilities

p[WA=1] = p[WA=1 ∩ N1=A] +

p[WA=1 ∩ N1=B,N2=A] +

p[WA=1 ∩ N1=B,N2=B] +

p[WA=1 ∩ N1=B,N2=A∪B] +

p[WA=1 ∩ N1=A∪B]

We can rewrite the conditional probability formula

P[E1 | E2] = P[E1 ∩ E2] / P[E2]

to express the probability of the intersection

P[E1 ∩ E2] = P[E1 | E2] * P[E2]

p[WA=1] = p[WA=1 | N1=A] * p[N1=A] +

p[WA=1 | N1=B,N2=A] * p[N1=B,N2=A] +

p[WA=1 | N1=B,N2=B] * p[N1=B,N2=B] +

p[WA=1 | N1=B,N2=A∪B] * p[N1=B,N2=A∪B] +

p[WA=1 | N1=A∪B] * p[N1=A∪B]

Assumptions

Now we apply the rules from the problem description:

Alice winning rule

Alice wins on the first occurence of A

p[WA=1 | N1=A] = p[WA=1 | N1=B,N2=A] = 1

Bob winning rule

Alice looses on the first 2 consecutive occurrences of B

p[WA=1 | N1=B,N2=B] = 0

Independent variables

With individual event probabilities

pA = p[N=A]
pB = p[N=B]

we have

p[N1=B,N2=A] = p[N1=B] * p[N2=A] = pB * pA

p[N1=B,N2=B] = p[N1=B] * p[N2=B] = pB * pB

p[N1=B,N2=A∪B] = p[N1=B] * p[N2=A∪B] = pB * (1-pA-pB)

The probability of Alice winning the game if N1 was not A or B is the overall probability of winning, same for N2 not equal to A or B.

p[WA=1 | N1=A∪B] = p[WA=1 | N1=B,N2=A∪B] = p[WA=1]

Probability equation

Let pWA be the probability that Alice wins,

pWA = p[WA=1]

We can finally derive the expression for pWA :

pWA = 1 * pA +

1 * pB * pA +

0 * pB * pB +

pWA * pB * (1-pA-pB) +

pWA * (1-pA-pB)

which leads to

pWA = (pA + pB * pA) / (pA + pB * pA + pB*pB)

Conclusion

We write down the winning probability as a sum over a disjoint partition of outcome subsets, called events. Using conditional probability, we express each event probability as a product of the conditional probability of winning given this event and the probability of the event.

To help us break down the sample space, we use a decision tree based on the rules of the game. They usually give us winning / losing outcomes assuming some conditions on previous events.

This lets us come up with a linear equation of which pWA is a solution.

Annex

Matplotlib input

import matplotlib.pyplot as plt
S1 = [1, 1, 1, 1, 1, 1]
S2 = [1, 2, 3, 4, 5, 6, 5, 4, 3, 2, 1]
S3 = [1, 3, 6, 10, 15, 21, 25, 27, 27, 25, 21, 15, 10, 6, 3, 1]
plt.plot(range(1, len(S1)+1), S1, label='N=1', marker='.')
plt.plot(range(2, len(S2)+2), S2, label='N=2', marker='+')
plt.plot(range(3, len(S3)+3), S3, label='N=3', marker='*')
plt.xticks(range(1, len(S3)+3))
plt.title('Count for the sum of N D-face dice values, D=6')
plt.legend()
plt.show()

Graphviz input

digraph G {

N1 -> N2;

O -> A1;

O -> B1;

O -> C1;

B1 -> A2;

B1 -> B2;

B1 -> C2;

O [label="Ω"];

A1, A2 [label="A"];

B1, B2 [label="B"];

C1, C2 [label=<<O>A ∪ B</O>>];

{ rank = same;

N1; A1; B1; C1; };

}

Friday, April 29, 2022

String parsing with Finite State-machine

Introduction

Finite State-machine

Conclusion

Annex

Dot file for decision tree

Dot file for finite state machine

Friday, April 22, 2022

Prefix sum vector & caching

Introduction

Input sample

Output sample

Input parsing

Brute force solutions

With BigInteger

With modular arithmetics

Log transform, a failed approach

Prime Factorization, an exact approach

Reduced problem set

Generalization to vector

Implementation details

Conclusion

Monday, April 18, 2022

Probability computation via decision tree

Introduction

Probability 101

Probability distribution for an individual event

Dynamic programming

Decision tree

Assumptions

Alice winning rule

Bob winning rule

Independent variables

Probability equation

Conclusion

Annex

Matplotlib input

Graphviz input

About Me

Blog Archive