Safe pattern matching in CloudTube - cadence's weblog (personal blog)

Today I implemented video filters for CloudTube.

This was more of a challenge than it appears, despite the large size and scope of that commit.

On New CloudTube, I want all its features to be available without people needing to enable JavaScript. That includes filters.

When people create a filter it is because they want to avoid some certain kind of content. Rather than matching titles exactly, some form of pattern matching against the title is extremely useful. What kinds of pattern matching do you know about already?

Here's the only ones I found in my research:

Regular expressions
Globs

Yeah, really. I thought there would be more.

Globs

Globs are most commonly used to match paths of files, like **/*.png. The disadvantage here is that video titles are not filesystem paths. In particular, ** matches / and * does not, which is completely useless for matching video titles. I am also concerned about platform compatibility with that, since Windows's path separator is \.

Globs aren't what we want here.

Regular expressions

Regular expressions are what we want! They're commonplace, so at least some people will already be familiar with how to use them precisely. They're straightforward to use, so new people will be able to work out how to match against a simple video title quite quickly. They don't interfere with the alphabet, so people who don't care about pattern matching won't be interfered with. They're great from the perspective of the people using them.

Unfortunately, we can't have them.

Regular expressions have no guarantees about how long they will take to run. While running, they block thread execution, which is very bad in the context of the node.js event loop.

People being able to provide arbitrary regular expressions for the server to match is a documented security issue, called "regular expression denial of service", or "redos" for short.

There's a fair bit written about this, but you should read Cloudflare's article now if you don't understand the principles.

A regular expressions represents an NFA, a non-deterministic finite state automaton. The key word here is "non-deterministic", which means that when the expression gets to a certain point, there are multiple ways in which it can continue. In DFAs, the matching always advances towards the end of the string, so you know that it will end at some point. With NFAs, this is not so. At the branch, any of the paths may lead to a match, and all of the paths need to be checked to see if they match. This means that every time there is a non-deterministic transition, the amount of computing you must do to verify the expression will increase significantly.

Cloudflare's report is about an incident that happened by accident. On CloudTube, the patterns and video titles can be set by anyone, so if I were naively using regular expressions, it would be extremely easy to force catastrophic backtracking and make the server unresponsive.

Possible solution: Limiting regular expression runtime

If the expression can be run in another thread, leaving the main thread alone to respond to other requests normally, then it can also be terminated if it exceeds a runtime that would be dangerous.

This method still has disadvantages, though, and one of these is that each regular expression will try to consume 100% of the CPU's time until it terminates.

My solution

My solution was to create a new language for patterns. Again, here is the documentation for it.

The key feature of this language is that every step is deterministic. So for each part of the pattern that is processed, the expression will always, unconditionally, advance closer to the end of the string that is being matched. This way, runtime should never be an issue.

I was also able to recognise problems with the expression and point them out, if the person makes a mistake in the syntax while writing the pattern.

My pattern grammar cannot recognise the regular languages, but since we're matching video titles here, and it's not crucial to get a 100% hit rate, I think that's okay. Some fairly expressive patterns are still possible, and if I see reports of legitimate use cases, it is possible to extend the language's capabilities to match more things even with no backtracking.

Conclusion

It's pretty late and my brain isn't working well so I hope this post all made sense.

Cadence