RegEx + bbcode is bad

From the last article , I've addressed the blocking issue of Rosemary.

But the problem still persists! So I did a profiling instead.

Following this documentation, I was able to generate a report which looked like this:

[Summary]:
  ticks  total  nonlib   name
 41027   88.8%   89.1%  JavaScript
  4533    9.8%    9.8%  C++
   199    0.4%    0.4%  GC
   160    0.3%          Shared libraries
   494    1.1%          Unaccounted
.
.
.
[JavaScript]:
  ticks  total  nonlib   name
 40064   86.7%   87.0%  RegExp: \\[([a-z][0-9a-z]*?)([\\s\\S]*?)\\]([\\s\\S]*?)\\[\\/\\1\\]
    37    0.1%    0.1%  KeyedLoadIC: A keyed load IC from the snapshot {1}
    32    0.1%    0.1%  Builtin: CallFunction_ReceiverIsNotNullOrUndefined
    23    0.0%    0.0%  Builtin: CallFunction_ReceiverIsAny
    19    0.0%    0.0%  LazyCompile: ~deserializeObject /home/penguin/works/blog/node_modules/bson/lib/bson/parser/deserializer.js:37:33
    17    0.0%    0.0%  KeyedLoadIC: A keyed load IC from the snapshot
    16    0.0%    0.0%  Stub: LoadICStub
    15    0.0%    0.0%  Stub: CEntryStub
.
.
.
[Bottom up (heavy) profile]:
 Note: percentage shows a share of a particular caller in the total
 amount of its parent calls.
 Callers occupying less than 2.0% are not shown.

  ticks parent  name
 40064   86.7%  RegExp: \\[([a-z][0-9a-z]*?)([\\s\\S]*?)\\]([\\s\\S]*?)\\[\\/\\1\\]
 40064  100.0%    v8::internal::Runtime_RegExpExecMultiple(int, v8::internal::Object**, v8::internal::Isolate*)
 39260   98.0%      LazyCompile: *StringReplaceGlobalRegExpWithFunction native regexp.js:297:47
 39260  100.0%        LazyCompile: *[Symbol.replace] native regexp.js:377:23
 39260  100.0%          LazyCompile: *replace native string.js:129:23
 32394   82.5%            LazyCompile: *parse /home/penguin/works/blog/blog/modular/bbcode/parser.js:63:7
  4449   11.3%            LazyCompile: ~bbcode_match /home/penguin/works/blog/blog/modular/bbcode/parser.js:42:21
  2417    6.2%            LazyCompile: *bbcode_match /home/penguin/works/blog/blog/modular/bbcode/parser.js:42:21
   804    2.0%      LazyCompile: ~StringReplaceGlobalRegExpWithFunction native regexp.js:297:47
   804  100.0%        LazyCompile: *[Symbol.replace] native regexp.js:377:23
   804  100.0%          LazyCompile: *replace native string.js:129:23
   804  100.0%            LazyCompile: ~bbcode_match /home/penguin/works/blog/blog/modular/bbcode/parser.js:42:21

  2138    4.6%  v8::internal::RegExpMacroAssembler::CaseInsensitiveCompareUC16(unsigned char*, unsigned char*, unsigned long, v8::internal::Isolate*)
  2138  100.0%    v8::internal::Runtime_RegExpExecMultiple(int, v8::internal::Object**, v8::internal::Isolate*)
  2103   98.4%      LazyCompile: *StringReplaceGlobalRegExpWithFunction native regexp.js:297:47
  2103  100.0%        LazyCompile: *[Symbol.replace] native regexp.js:377:23
  2103  100.0%          LazyCompile: *replace native string.js:129:23
  1732   82.4%            LazyCompile: *parse /home/penguin/works/blog/blog/modular/bbcode/parser.js:63:7
   248   11.8%            LazyCompile: ~bbcode_match /home/penguin/works/blog/blog/modular/bbcode/parser.js:42:21
   123    5.8%            LazyCompile: *bbcode_match /home/penguin/works/blog/blog/modular/bbcode/parser.js:42:21

The RegEx of BBCode matching

It seemed like this peculiar Regex is using 86% of the CPU time. And the location is in a file called bbcode. Hmm, seems like it is not a good idea to use a RegEx here.
Here is the code snippet I used to parse the bbcode:

class BBCodeParser
{
    constructor( scopeObj ) { ... };

    static get stackMatch()
    {
        return /\[([a-z][0-9a-z]*?)([\s\S]*?)\]([\s\S]*?)\[\/\1\]/ig;
    }

    static get typeMatch()
    {
        return /([a-z][0-9a-z]*)\=\"([^"]+)\"/ig;
    }

    static bbcode_match( str, handler )
    {
        return str && str.replace( BBCodeParser.stackMatch, handler );
    }
    .
    .
    .

This is bad, soooo bad. Now I am half-doomed for using bbcode to store all of my blog entries.

Abandoning RegEx

Cache everything! Trading off memories over computing power is always an option. Because blog entries are static after it is published. I should be able to compile it on publish.

However I want to try something. Because RegEx is notorious for it's slowness. I am gonna try to change this piece of code to iteration matching, and see if it help.

( half day later )

Since this is a semi document parser. The recursive matching is done outside this scope ( nested levels are done asynchronously ). The sole purpose of this piece of code is to match the out most token. Treating any nested content as the content of the parent node:

Replacement code

The BLABBLAB thing is just a replacement of console.log. That way I could saftly grep unwanted console.log next time.

diff --git a/blog/modular/bbcode/parser.js b/blog/modular/bbcode/parser.js
index a2a8282..3fde43b 100644
--- a/blog/modular/bbcode/parser.js
+++ b/blog/modular/bbcode/parser.js
@@ -18,6 +18,12 @@ const decodeStr = function( str )
 		;
 };
 
+const IsToken = function( c )
+{
+	var v = c.charCodeAt( 0 );
+	return ( 97 <= v && v <= 122 ) || ( 48 <= v && v <= 57 );
+};
+
 class BBCodeParser
 {
 	constructor( scopeObj )
@@ -29,11 +35,6 @@ class BBCodeParser
 		})();
 	};
 
-	static get stackMatch()
-	{
-		return /\[(*?)([\s\S]*?)\]([\s\S]*?)\[\/\1\]/ig;
-	}
-
 	static get typeMatch()
 	{
 		return /(*)\=\"([^"]+)\"/ig;
@@ -41,7 +42,163 @@ class BBCodeParser
 
 	static bbcode_match( str, handler )
 	{
-		return str && str.replace( BBCodeParser.stackMatch, handler );
+		// This function serve as the following RegEx:
+		// = /\[(*?)([\s\S]*?)\]([\s\S]*?)\[\/\1\]/ig;
+		// and because this regex is slow
+		// will now use iteration instead
+
+		var l = str && str.length;
+		var strOut = "";
+
+		var tokenOpened = "";
+		var tokenRCounter = 0;
+
+		var tContent = "";
+		var tOptions = "";
+		var cTTemp;
+
+		for( var i = 0; i < l; i ++ )
+		{
+			var c = str;
+
+			if( c == "[" )
+			{
+				var TokenClosing = "";
+				var mOptions = "";
+				var mToken = "";
+
+				cTTemp = str[ ++ i ];
+
+				if( cTTemp == "/" )
+				{
+					TokenClosing = cTTemp;
+					i ++;
+				}
+
+				// Get Token Name
+				for( cTTemp = str; i < l; i ++, cTTemp = str )
+				{
+					if( IsToken( cTTemp ) )
+					{
+						mToken += cTTemp;
+						continue;
+					}
+
+					break;
+				}
+
+				// No token opened but we got a closing token??!
+				if( !tokenOpened && TokenClosing )
+				{
+					throw new Error( "Syntax Error, Unexpected closing [/" + mToken + "]" );
+				}
+
+				// skip this if we didn't get a token
+				if( !mToken )
+				{
+					// BLABBLAB( "-------> " + cTTemp );
+					if( tokenOpened ) tContent += "[" + cTTemp;
+					else strOut += "[" + cTTemp;
+
+					continue;
+				}
+
+				// BLABBLAB( "Got Token: " + mToken );
+
+				// If said token is not the opened token
+				if( tokenOpened )
+				{
+					var tSkip = ( tokenOpened != mToken );
+
+					if( !tSkip )
+					{
+						if( TokenClosing && 0 < tokenRCounter )
+						{
+							tokenRCounter --;
+							tSkip = true;
+						}
+						// Nested token by the same name
+						// i.e. tokenOpened == mToken
+						else if( !TokenClosing )
+						{
+							tokenRCounter ++;
+							tSkip = true;
+						}
+					}
+
+					if( tSkip )
+					{
+						tContent += "[" + TokenClosing + mToken + "]";
+						continue;
+					}
+				}
+
+				// BLABBLAB( "After Token Name: \"" + cTTemp + "\"" );
+
+				// Closing token does not allow options to be set
+				if( TokenClosing && cTTemp == "]" && mToken == tokenOpened )
+				{
+					// BLABBLAB( "TokenClosed[" + mToken + "]" );
+					// BLABBLAB( "{{{" );
+					// BLABBLAB( tOptions );
+					// BLABBLAB( tContent );
+					// BLABBLAB( "}}}" );
+
+					strOut += handler( mToken, tOptions, tContent );
+
+					mToken
+						= tokenOpened
+						= tOptions
+						= tContent
+						= "";
+					continue;
+				}
+
+				// consume the token options
+				for( cTTemp = str; i < l; i ++, cTTemp = str )
+				{
+					// Unexpected open bracket, invalidate this matching
+					if( cTTemp == "[" )
+					{
+						// BLABBLAB( "\"" + mToken + "\" is in fact not a token" );
+
+						strOut += "[" + mToken;
+
+						// Step back and rematch from here
+						i --;
+
+						mToken
+							= tOptions
+							= tContent
+							= "";
+
+						break;
+					}
+					else if( cTTemp == "]" )
+					{
+						tokenOpened = true;
+						// BLABBLAB( "Token[" + mToken + "] Option: " + tOptions );
+						break;
+					}
+
+					tOptions += cTTemp;
+				}
+
+				tokenOpened = mToken;
+			}
+			else if( tokenOpened )
+			{
+				tContent += c;
+			}
+			else strOut += c;
+		}
+
+		if( tokenOpened )
+		{
+			throw new Error( "Syntax Error, Missing: [/" + tokenOpened + "]" );
+		}
+
+		return strOut;
 	}
 
 	static ParseProps( prop )
@@ -90,7 +247,7 @@ class BBCodeParser
 		var tokens = {};
 
 		// Tokenizer
-		var __parse = function( matches, type, props, content )
+		var __parse = function( type, props, content )
 		{
 			var name = type.toLowerCase();
 
@@ -139,5 +296,4 @@ class BBCodeParser
 	}
 }
 
-
 module.exports = BBCodeParser;

The advantage of parsing the syntax manually allows me to detect wether there is a syntax error on the document. But I am not going to do that. From the above code I used a passive matching that just consumes everything unexpected ( such as brackets [] ) as content. However some valid but unmatched open / closing tags like [/this] will throw an exception.

But since the bbcodes are generated dynamically in AstroEdit. I shouldn't worried much about invalid syntax. Actually, I could just assumed that the syntax is always right!

Test result

root@debian:/home/penguin# ab -r -k -c 20 -n 250 http://blog.botanical.astropenguin.net/
This is ApacheBench, Version 2.3 <$Revision: 1604373 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking blog.botanical.astropenguin.net (be patient)
Completed 100 requests
Completed 200 requests
Finished 250 requests


Server Software:        lighttpd/1.4.39
Server Hostname:        blog.botanical.astropenguin.net
Server Port:            80

Document Path:          /
Document Length:        17622 bytes

Concurrency Level:      20
Time taken for tests:   4.027 seconds
Complete requests:      250
Failed requests:        0
Keep-Alive requests:    250
Total transferred:      4471770 bytes
HTML transferred:       4405500 bytes
Requests per second:    62.08 [#/sec] (mean)
Time per request:       322.177 [ms] (mean)
Time per request:       16.109 [ms] (mean, across all concurrent requests)
Transfer rate:          1084.36 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   4.5      0      18
Processing:    47  317 558.0    198    3945
Waiting:       46  314 555.2    197    3929
Total:         47  318 561.5    198    3963

Percentage of the requests served within a certain time (ms)
  50%    198
  66%    218
  75%    230
  80%    240
  90%    276
  95%    740
  98%   2984
  99%   3952
 100%   3963 (longest request)

Damn! That was like 1900% performance boost! I am so happy with that.

Now the profiling:

[JavaScript]:
  ticks  total  nonlib   name
     3    0.4%    0.4%  Builtin: CallFunction_ReceiverIsNotNullOrUndefined
     1    0.1%    0.1%  Stub: StoreFieldStub
     1    0.1%    0.1%  Stub: LoadFieldStub
     1    0.1%    0.1%  Stub: LoadConstantStub
     1    0.1%    0.1%  Stub: FastNewContextStub
     1    0.1%    0.1%  LazyCompile: ~fs.Stats fs.js:144:20
     1    0.1%    0.1%  LazyCompile: ~emitTwo events.js:104:17
     1    0.1%    0.1%  LazyCompile: ~emit events.js:136:44
     1    0.1%    0.1%  LazyCompile: ~StringDecoder.write string_decoder.js:64:41
     1    0.1%    0.1%  LazyCompile: ~RoundRobinHandle.handoff cluster.js:186:46
     1    0.1%    0.1%  LazyCompile: ~<anonymous> cluster.js:197:55
     1    0.1%    0.1%  LazyCompile: *slice native string.js:242:21
     1    0.1%    0.1%  LazyCompile: *<anonymous> cluster.js:745:18
     1    0.1%    0.1%  Handler: An IC handler from the snapshot

[Summary]:
  ticks  total  nonlib   name
    16    2.0%    2.0%  JavaScript
   783   95.5%   96.2%  C++
     2    0.2%    0.2%  GC
     6    0.7%          Shared libraries
    15    1.8%          Unaccounted

[Bottom up (heavy) profile]:
 Note: percentage shows a share of a particular caller in the total
 amount of its parent calls.
 Callers occupying less than 2.0% are not shown.

  ticks parent  name
   656   80.0%  syscall

    28    3.4%  node::ContextifyScript::New(v8::FunctionCallbackInfo<v8::Value> const&)
    28  100.0%    v8::internal::Builtin_HandleApiCallConstruct(int, v8::internal::Object**, v8::internal::Isolate*)
    28  100.0%      LazyCompile: ~runInThisContext node.js:346:28
    28  100.0%        LazyCompile: ~NativeModule.compile node.js:427:44
    28  100.0%          LazyCompile: ~NativeModule.require node.js:361:34
     5   17.9%            LazyCompile: ~startup node.js:12:19
     5   17.9%            Function: ~<anonymous> module.js:1:11
     3   10.7%            Function: ~<anonymous> util.js:1:11
     3   10.7%            Function: ~<anonymous> internal/child_process.js:1:11
     2    7.1%            LazyCompile: ~setupGlobalVariables node.js:215:32
     2    7.1%            LazyCompile: ~Module._load module.js:381:24
     2    7.1%            Function: ~<anonymous> stream.js:1:11
     2    7.1%            Function: ~<anonymous> cluster.js:1:11
     1    3.6%            LazyCompile: ~setupGlobalTimeouts node.js:243:31
     1    3.6%            LazyCompile: ~listenAfterLookup net.js:1393:29
     1    3.6%            Function: ~<anonymous> timers.js:1:11
     1    3.6%            Function: ~<anonymous> child_process.js:1:11

Looks good to me. Note that I didn't even cached the article yet. Aaaand it seemed I don't have to cache it now:P

Tag(s): regexp bbcode

斟酌鵬兄

Fri Jun 03 2016 09:19:13 GMT+0000 (Coordinated Universal Time)

Last modified: Wed Jan 04 2017 02:14:28 GMT+0000 (Coordinated Universal Time)