Friday, April 29, 2022

String parsing with Finite State-machine

Introduction


Problem C in Codeforces Educational Codeforces Round 125 describes a string processing process with following rules

  • characters only include '(' and ')'
  • find shortest prefix that is either
    • a palindrome
    • a regular sequence: it should be well-formed, like an arithmetic expression
  • repeat parsing process

For a regular sequence, we could maintain a counter. We increment it on opening brackets and decrement on closing brackets. If it reaches 0 after being positive only it is a valid prefix matching a regular sequence. If it becomes negative before reaching 0, sequence is not regular.

For a palindrome, naive palindrome checking complexity is too high, O(N) for a general alphabet on a string of size N, as we need to compare N/2 characters.

We need to leverage the binary & shorted prefix properties of the problems.

Running a case disjunction lets us discover properties of the output.

Finite State-machine


We look at all the possibilities for the first few characters. We draw a decision tree to develop an intuition on parsing logics.


We can then assign a state to every possible pattern of minimal prefix being either a palindrome or regular. Here's the finite state machine version:





See editorial for final solution.

Conclusion



For string parsing problem, it helps to visualize the simple cases via a decision tree. We then expand it to get a more exhaustive picture to include most cases. We can leverage compiler theory / Finite State Machine to extract the possible states. For validation purpose we run a sample input against the FSM to see how the match progresses as input is parsed.

Annex


Dot file for decision tree


digraph G { 

  rankdir="LR"

  Start -> O1; 

  Start -> C1; 

  Start -> E1; 

  O1 -> O2; 

  O1 -> C2; 

  C1 -> OStar;

  OStar -> C3; 

  OStar -> E2; 

  Start [label="^"];

  O1, O2 [label="("];

  C1, C2, C3 [label=")"];

  E1, E2 [label="$"]

  OStar [label="(*"];

}~


Dot file for finite state machine


digraph FSM {

  rankdir=LR;

  Invisible -> Start;


  Start -> S1[label="("];

  S1 -> R[label=")"]

  S1 -> P2[label="("]


  R -> Start[style="dashed"]

  P2 -> Start[style="dashed"]

  Pn -> Start[style="dashed"]


  Start -> P1[label=")"];

  P1 -> P1[label="("]

  P1 -> Pn[label=")"]


  Start -> End[label="EOF"]

  S1 -> End[label="EOF"]

  P1 -> End[label="EOF"]

  


  Invisible[style=invis];

  Start[label="^"];

  End[label="$", shape=doublecircle];

}

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.