Regular Expressions On Steroids
02:31 PM | by Ivan Kudryavtsev |Is your code fast enough? Modern development reality, especially when thinking about web development desn’t care about effective code because there are a lot of cheap ways to increase the performance of applications, for example one could optimize database structure, add sophisticated indexes and so on. But sometimes, cheap ways don’t give you the clue to performance tweaking. In this post I will cover the performance of regular expressions and show you the task which requires every optimization available.
Last month we have implemented the parental control feature for our in-house ISP Zing. We have decided to store black and whitelisted data in MySQL database such as URLs, domain names, badwords and so on. This solution is OK for us since it gives simple way to manipulate data (update and change) and simple way to request data to do different checks.
Next, we have implemented SQUID-based ICAP module which implements next functions:
- checks headers on blacklisted/whitelisted domains, URLs, keywords;
- checks content for badwords
- replaces content if required.
We had implemented the module in C/C++ as C-ICAP plugin which works with SQUID 3. All had functioned perfectly but one thing was annoying. We decided to use regular expressions to check content for badwords and we have implemented PCRE library find/replace functions to complete the tasks (by the way, PCRE doesn’t handle replace by default, so we have ported replace function from PHP function source code).
We had been very upset when found the PCRE checks web page of 1MB near to 0.2 second on our Xeon 3.2 GHz processor (regexes were about 10-20 KB of size). We had realized that we will not be able to serve even 100 simultaneos connects on our SQUID if we will not found the way to optimize it. We also had found that standard UNIX grep works very fast and uses enough of regular expressions syntax to handle our tasks (it takes about 0.001-0.005 second to handle the same document).
Trying to find the understanding of poor performance of PCRE we have moved to mathematical basis of PCRE – finite machines. The theory explains the difference between DFA and NFA approaches used by regular expression engines. To sum up, DFA algorithms are much more faster then NFA, and PCRE is NFA engine. So we had found that we are not able to use PCRE to handle our task and tried to found another tool to complete the module.
We reviewed two concurrent ideas: the first assumed to use grep source codebase to implement regex engine, but this way wasn’t convinient since grep includes global variables and not thread-safe (since utilitity doesn’t need multithreading), the other assumed to use special library and we have found one – RE2, produced by Google when he met the same difficultes as we did. RE2 is C++ library with clear syntax and usage guidelines. The library is thread safe by design, so we have decided to use it in our module.
The implementation of our C-ICAP module with RE2 engine allowed us to achive even faster performance then required with easy to use integration process.

You must be logged in to post a comment.