This is the first post in a series on coding up algorithms for constructing automata from regexes.
A quick note before we begin: I will link to relevant wikipedia pages, but in my opinion these pages are often not very comprehensible, so please do ignore them if you want and I hope I can give a more simplified and digestible intro to ideas you may be unfamiliar with.
Audience Assumptions: experience with regexes and basic knowledge of finite state machines. Code examples will be in no-frills python.
Intro
I've long been drawn to Finite State Machines (FSM) and Automata Theory. These areas of study enjoy a close relationship with the day to day practicalities of programs because, for all their theoretical nature, they directly specify methods for how we can make regexes usable. But drawing out that relationship is easier said than done! I've stared blankly at many a dense paper, willing myself to understand and achieving precisely nothing.
Anyway, I recently read Andrew Gallant's (AKA burntsushi) excellent and long (very long) post on the internals of the regex-automata crate. Naturally automata are discussed in detail, a central topic being Nondeterministic Finite Automata (NFA). One thing that jumped out to me was the section on future work mentioning Glushkov's NFA construction [2]. I was interested to learn about this, and more so, to investigate the various ways of constructing automata from regexes.
As you may have guessed from this intro, this was not as easy as looking up a few definitions and throwing together some code. It's more: Act 1 Scene 1, Marcus hits a brick wall of understanding. Eventually though I think I've found some clarity on these ideas, and this is what I want to share with you because it's what I wish I'd been able to read when I didn't understand. First off, you may be wondering why this post's title includes Position automata (PA) when I've yet to mention the term? Position automaton is just another name for Glushkov's automaton! Already one bit of clarity provided I hope, btw I'm just going to use the PA abbreviation from now on.
Defining Automata
In this series we'll be following the constructions provided in the paper: 'On the mother of all automata: the position automaton' [1] by Sabine Broda, Markus Holzer, Eva Maia, Nelma Moreira, and Rogério Reis. This paper is both clear and interesting, if you're curious how different automata relate to one another, or about the formal properties of various constructions, then it wont waste your time.
So what is an automata? Let's start with the formal definition and a simple code implementation. An automata
is a set of states that the automata can be in. (capital sigma) is the set of symbols (or alphabet) this automata recognises. So for the string matching regexes we're used to, that is all the characters you can have in a string. (small delta) is the transition function, which tells you how to go from a state and symbol to another state (or set of states). is the "initial" or "start" state that the automata always starts in. is the set of "final", "finishing" or "accepting" states. If a sequence of symbols puts the automata in a final state then the automata "accepts" the sequence. The initial state can be final (a member of ).
Let's implement this in a very literal fashion, to make the connections clear. Supposing states are just integers, and symbols are characters in a string.
class Automata:
def __init__(self):
self.states: set[int] # Q
self.symbols: set[str] # Sigma
self.initial: int # I
self.final: set[int] # F
# delta
@staticmethod
def transition(state: int, symbol: str) -> int | None:
# return None when the given (state, symbol) pair
# does not have a corresponding transition state
...
Simple enough, all we need to provide in order to have an automata are these five elements, and there are many possible ways of structuring this in code. Depending on how we fill out these parameters and define the transition function we'll end up with different kinds of automata. This is an important point I really want to stress, different automata constructions are just different ways of constructing the elements of the five-tuple.
State Diagrams
State diagrams are an intuitive visual aid to understanding automata. We depict all the states, indicating whether they are initial or final, and the transitions between them. Final states will always have a double ring.
Example Automata
Consider the regex "ace"
, which can be converted to:
class Automata:
def __init__(self):
self.states: set[int] = {0, 1, 2, 3}
self.symbols: set[str] = set("abcdef")
self.initial: int = 0
self.final: set[int] = {3}
@staticmethod
def transition(state: int, symbol: str) -> int | None:
table = {
(0, "a"): 1,
(1, "c"): 2,
(2, "e"): 3,
}
return table.get((state, symbol))
This automata is depicted by this state diagram:
Position Automata
Okay, let's get stuck in to the details of PA, starting again with the formal definition:
There's a lot more going on here, but we're just going to break it down element by element. To start with,
Set of States
ab
ab+
a*b
Warning: be careful when reading regexes containing alternation in papers. Alternation in code is often represented bya|b
, but in papers you will seeused. Be careful not to confuse the 'plus' used in papers with the 'one or more' operator in code. Throughout this post I will use |
for alternation.
Okay so what's
Then the corresponding marked regular expression is given by indexing all the symbols, meaning we mark the elements of
Since
Finally we have:
Final States
Let's begin by talking about just ab
, the position index for b
would be in
This does come across in the formal definition, look for where
This kind of set building notation is analogous to comprehensions in Python, we can read the above statement as:
Set Comprehension Examples
A = {e for e in Elems}
B = {j for (i, j) in Items if i == 1}
Now let's break down the parts of this definition starting with
( is a subset of as not all possible words may be accepted by )
With
Defined Recursively
- If
, then ( is the empty word or string). - If
, then where . - If
, then (alternation). - If
, then where is the set of all words where words from are joined with words from (concatenation). Concatenation is sometimes also writen as - If
, then contains zero or more words from .
Using the same marked regex from earlier,
Looking at
Hopefully it's relatively clear that our example regex will only ever have
We're not done I'm afraid! 😵💫
What we actually wanted was
Starting as always with the definition:
We know what
This particular bit of notation is somewhat peculiar, not least because
Python Sketch
import re
def get_language(alpha: re.Pattern) -> set[str]:
...
def nullable(alpha: re.Pattern) -> str | set[str]:
if "" in get_language(alpha):
return ""
return set()
Where
Back to our example
And after all that!
😵
Transition
Last one, and it's a biggie: the transition function
Python Sketch
import re
regex: re.Pattern = ...
POS_0 = ... # symbols with indexes
def follow(alpha: re.Pattern, index: int) -> set[int]:
...
def delta_pos(index: int, symbol: str) -> set[int]:
return {j for j in follow(regex, index) if symbol == POS_0[j]}
Nothing too fancy, and understanding this is just down to grasping
We still don't know what
Aha! This looks familiar, and hopefully this form helps guide the intuition that "following" is just about finding what symbols can follow other symbols in the language of our regex. Essentially
Returning to our example
Let's consider
Which makes
That's almost everything for
At which point you may be exclaiming "not another definition" in frustration! Don't worry,
Yep, we just put
Back to the transition function, let's put in some values and see what we get.
Because
Putting it All Together
Let
The position automaton for
Example
Let
State diagram:
This diagram let's us observe clearly why PA is a non-deterministic finite automata, because multiple states can be returned from the transition function, instead of only ever returning one. For example, if
Example
Let
State diagram:
In this example we can see that, since this regex accepts the empty word
Where To Now?
We made it! Although I don't think all the details we've been through necessarily lend themselves to a natural and intuitive mental model of PA, what they do provide is a clear mechanics of how we can construct these automata. Since using these definitions will firm up our understanding of these ideas, and since we currently we have no proper implementation, our path forward is easy to choose. In the next post of this series we'll work on a more serious Python implementation of the construction of PA from regexes using the definitions we've presented here.
Something of interest, the notions of
Acknowledgements
Thanks to my friend Declan Kolakowski for proof reading the draft of this post.
References
- Broda, S., Holzer, M., Maia, E., Moreira, N. and Reis, R., 2017, July. On the mother of all automata: the position automaton. In International Conference on Developments in Language Theory (pp. 134-146). Cham: Springer International Publishing. link
- Glushkov, V.M., 1961. The abstract theory of automata. Russian Mathematical Surveys, 16(5), p.1.