Fast regular expression index with finite state transducer

March 2, 2024 by Geofrey Ernest

I needed a fast index that supports regular expressions for vince. The data model used by vince is append only, no deletes or updates. This means once collected the data is guaranteed not to change, so I wanted an index that will never be updated/changed, my search brought me to vellum a go library implementing a FST (finite state transducer).

In this post I will explain how vince uses vellum to offer fast exact and regular expression search that powers property filter feature of vince web analytics api.

Our data model

syntax = "proto3";
package v1;

message Data {
  int64 timestamp = 1;
  int64 id = 2;
  optional bool bounce = 3;
  bool session = 4;
  bool view = 5;
  double duration = 6;

  string browser = 19;
  string browser_version = 20;
  string city = 26;
  string country = 23;
  string device = 18;
  string domain = 25;
  string entry_page = 9;
  string event = 7;
  string exit_page = 10;
  string host = 27;
  string os = 21;
  string os_version = 22;
  string page = 8;
  string referrer = 12;
  string region = 24;
  string source = 11;
  string utm_campaign = 15;
  string utm_content = 16;
  string utm_medium = 14;
  string utm_source = 13;
  string utm_term = 17;
  string tenant_id = 28;

  message List { repeated Data items = 1; }
}

All string fields are referred to as properties/dimensions. They help breakdown and digest web analytics metrics. For instance, Aggregate visitors by country.

We use Apache Arrow for in memory representation and Apache Parquet for on disc (cold) storage. We keep an index for each property.

Data is collected in big chunks called granules before being stored in disc. The size of granule is configurable, by default it is 256MiB(This is in memory size of arrow.Record + index) . When the granule is reached we convert it into parquet file with a single row group. All properties are encoded as dictionaries and all data fields/columns are compressed using zstd.

example snippets below are approximate of the actual implementation in vince. We actually use array.Dictionary for property columns.

Index mapping

Since we store the data in columns and property values are string dictionaries, for faster search we need to map between value => row_id, but we know that same value can appear on multiple rows so the mapping is value => [row_1,row_2,row_n]

We use roaring.Bitmap for efficient storing/serializing of rows mapping. This is how it looks like in Go.

type ColumnIndex struct {
	mapping map[string]*roaring.Bitmap
}

We can now add a method to build the mapping

unc (c *ColumnIndex) IndexProperty(prop *array.String) {
	for row := 0; row < prop.Len(); row++ {
		value := prop.Value(row)
		bitmap, ok := c.mapping[value]
		if !ok {
			// Create a new bitmap
			bitmap = new(roaring.Bitmap)
			c.mapping[value] = bitmap
		}
		bitmap.Add(uint32(row))
	}
}

Vellum

Now we have our mapping we can build a vellum fst from it. One challenge we have is vellum only accept uint64 as values. To work around this we store bitmaps together with the the vellum index so we can use position of the bitmaps as value.

First we need to update ColumnIndex to store the bitmaps and serialized vellum fst

type ColumnIndex struct {
	mapping map[string]*roaring.Bitmap
	// bitmaps positioned by sorted keys of mapping
	values []*roaring.Bitmap
	//serialized vellum fst
	fst []byte
}

Now, our updated index method will look like this

error handling is omitted for brevity

func (c *ColumnIndex) IndexProperty(prop *array.String) {
	for row := 0; row < prop.Len(); row++ {
		value := prop.Value(row)
		bitmap, ok := c.mapping[value]
		if !ok {
			// Create a new bitmap
			bitmap = new(roaring.Bitmap)
			c.mapping[value] = bitmap
		}
		bitmap.Add(uint32(row))
	}
	var out bytes.Buffer
	index, _ := vellum.New(&out, nil)
	// vellum wants keys to be lexical sorted
	keys := make([]string, 0, len(c.mapping))
	for k := range c.mapping {
		keys = append(keys, k)
	}
	sort.Strings(keys)
	c.values = make([]*roaring.Bitmap, len(keys))
	for i, k := range keys {
		// store the value itmap
		c.values[i] = c.mapping[k]
		// Note we are storing the index where the value bitmap is.
		index.Insert([]byte(k), uint64(i))
	}
	index.Close()
	c.fst = out.Bytes()
}

Regular expression search

With out []*roaring.Bitmap slice and serialized vellum fst we can build our searching method. Our search goal is to find all rows which a matching term was found.

func (c *ColumnIndex) Search(term string) *roaring.Bitmap {
	// load vellum fst. Please move this out, you only need to load fst once and
	// reuse it multiple times.
	fst, _ := vellum.Load(c.fst)
	// compile regex transducer using github.com/blevesearch/vellum/regexp
	regexTransducer, _ := regexp.New(term)
	var b *roaring.Bitmap
	it, err := fst.Search(regexTransducer, nil, nil)
	for err == nil {
		_, r := it.Current()
		if b == nil {
			b = c.values[r].Clone()
		} else {
			b.And(c.values[r])
		}
		err = it.Next()
	}
	if b == nil {
		b = new(roaring.Bitmap)
	}
	return b
}

This is all you need to introduce fast regular expression search in your Go application. Please visit vellum documentation for more details on the library and its features.