RegEx + bbcode is bad
From the last article , I've addressed the blocking issue of Rosemary.
But the problem still persists! So I did a profiling instead.
Following this documentation, I was able to generate a report which looked like this:
Here is the code snippet I used to parse the bbcode:
This is bad, soooo bad. Now I am half-doomed for using bbcode to store all of my blog entries.
However I want to try something. Because RegEx is notorious for it's slowness. I am gonna try to change this piece of code to iteration matching, and see if it help.
( half day later )
Since this is a semi document parser. The recursive matching is done outside this scope ( nested levels are done asynchronously ). The sole purpose of this piece of code is to match the out most token. Treating any nested content as the content of the parent node:
The advantage of parsing the syntax manually allows me to detect wether there is a syntax error on the document. But I am not going to do that. From the above code I used a passive matching that just consumes everything unexpected ( such as brackets [] ) as content. However some valid but unmatched open / closing tags like [/this] will throw an exception.
But since the bbcodes are generated dynamically in AstroEdit. I shouldn't worried much about invalid syntax. Actually, I could just assumed that the syntax is always right!
Damn! That was like 1900% performance boost! I am so happy with that.
Now the profiling:
Looks good to me. Note that I didn't even cached the article yet. Aaaand it seemed I don't have to cache it now:P
But the problem still persists! So I did a profiling instead.
Following this documentation, I was able to generate a report which looked like this:
[Summary]:
ticks total nonlib name
41027 88.8% 89.1% JavaScript
4533 9.8% 9.8% C++
199 0.4% 0.4% GC
160 0.3% Shared libraries
494 1.1% Unaccounted
.
.
.
[JavaScript]:
ticks total nonlib name
40064 86.7% 87.0% RegExp: \\[([a-z][0-9a-z]*?)([\\s\\S]*?)\\]([\\s\\S]*?)\\[\\/\\1\\]
37 0.1% 0.1% KeyedLoadIC: A keyed load IC from the snapshot {1}
32 0.1% 0.1% Builtin: CallFunction_ReceiverIsNotNullOrUndefined
23 0.0% 0.0% Builtin: CallFunction_ReceiverIsAny
19 0.0% 0.0% LazyCompile: ~deserializeObject /home/penguin/works/blog/node_modules/bson/lib/bson/parser/deserializer.js:37:33
17 0.0% 0.0% KeyedLoadIC: A keyed load IC from the snapshot
16 0.0% 0.0% Stub: LoadICStub
15 0.0% 0.0% Stub: CEntryStub
.
.
.
[Bottom up (heavy) profile]:
Note: percentage shows a share of a particular caller in the total
amount of its parent calls.
Callers occupying less than 2.0% are not shown.
ticks parent name
40064 86.7% RegExp: \\[([a-z][0-9a-z]*?)([\\s\\S]*?)\\]([\\s\\S]*?)\\[\\/\\1\\]
40064 100.0% v8::internal::Runtime_RegExpExecMultiple(int, v8::internal::Object**, v8::internal::Isolate*)
39260 98.0% LazyCompile: *StringReplaceGlobalRegExpWithFunction native regexp.js:297:47
39260 100.0% LazyCompile: *[Symbol.replace] native regexp.js:377:23
39260 100.0% LazyCompile: *replace native string.js:129:23
32394 82.5% LazyCompile: *parse /home/penguin/works/blog/blog/modular/bbcode/parser.js:63:7
4449 11.3% LazyCompile: ~bbcode_match /home/penguin/works/blog/blog/modular/bbcode/parser.js:42:21
2417 6.2% LazyCompile: *bbcode_match /home/penguin/works/blog/blog/modular/bbcode/parser.js:42:21
804 2.0% LazyCompile: ~StringReplaceGlobalRegExpWithFunction native regexp.js:297:47
804 100.0% LazyCompile: *[Symbol.replace] native regexp.js:377:23
804 100.0% LazyCompile: *replace native string.js:129:23
804 100.0% LazyCompile: ~bbcode_match /home/penguin/works/blog/blog/modular/bbcode/parser.js:42:21
2138 4.6% v8::internal::RegExpMacroAssembler::CaseInsensitiveCompareUC16(unsigned char*, unsigned char*, unsigned long, v8::internal::Isolate*)
2138 100.0% v8::internal::Runtime_RegExpExecMultiple(int, v8::internal::Object**, v8::internal::Isolate*)
2103 98.4% LazyCompile: *StringReplaceGlobalRegExpWithFunction native regexp.js:297:47
2103 100.0% LazyCompile: *[Symbol.replace] native regexp.js:377:23
2103 100.0% LazyCompile: *replace native string.js:129:23
1732 82.4% LazyCompile: *parse /home/penguin/works/blog/blog/modular/bbcode/parser.js:63:7
248 11.8% LazyCompile: ~bbcode_match /home/penguin/works/blog/blog/modular/bbcode/parser.js:42:21
123 5.8% LazyCompile: *bbcode_match /home/penguin/works/blog/blog/modular/bbcode/parser.js:42:21
The RegEx of BBCode matching
It seemed like this peculiar Regex is using 86% of the CPU time. And the location is in a file called bbcode. Hmm, seems like it is not a good idea to use a RegEx here.Here is the code snippet I used to parse the bbcode:
class BBCodeParser
{
constructor( scopeObj ) { ... };
static get stackMatch()
{
return /\[([a-z][0-9a-z]*?)([\s\S]*?)\]([\s\S]*?)\[\/\1\]/ig;
}
static get typeMatch()
{
return /([a-z][0-9a-z]*)\=\"([^"]+)\"/ig;
}
static bbcode_match( str, handler )
{
return str && str.replace( BBCodeParser.stackMatch, handler );
}
.
.
.This is bad, soooo bad. Now I am half-doomed for using bbcode to store all of my blog entries.
Abandoning RegEx
Cache everything! Trading off memories over computing power is always an option. Because blog entries are static after it is published. I should be able to compile it on publish.However I want to try something. Because RegEx is notorious for it's slowness. I am gonna try to change this piece of code to iteration matching, and see if it help.
( half day later )
Since this is a semi document parser. The recursive matching is done outside this scope ( nested levels are done asynchronously ). The sole purpose of this piece of code is to match the out most token. Treating any nested content as the content of the parent node:
Replacement code
The BLABBLAB thing is just a replacement of console.log. That way I could saftly grep unwanted console.log next time.
diff --git a/blog/modular/bbcode/parser.js b/blog/modular/bbcode/parser.js
index a2a8282..3fde43b 100644
--- a/blog/modular/bbcode/parser.js
+++ b/blog/modular/bbcode/parser.js
@@ -18,6 +18,12 @@ const decodeStr = function( str )
;
};
+const IsToken = function( c )
+{
+ var v = c.charCodeAt( 0 );
+ return ( 97 <= v && v <= 122 ) || ( 48 <= v && v <= 57 );
+};
+
class BBCodeParser
{
constructor( scopeObj )
@@ -29,11 +35,6 @@ class BBCodeParser
})();
};
- static get stackMatch()
- {
- return /\[(*?)([\s\S]*?)\]([\s\S]*?)\[\/\1\]/ig;
- }
-
static get typeMatch()
{
return /(*)\=\"([^"]+)\"/ig;
@@ -41,7 +42,163 @@ class BBCodeParser
static bbcode_match( str, handler )
{
- return str && str.replace( BBCodeParser.stackMatch, handler );
+ // This function serve as the following RegEx:
+ // = /\[(*?)([\s\S]*?)\]([\s\S]*?)\[\/\1\]/ig;
+ // and because this regex is slow
+ // will now use iteration instead
+
+ var l = str && str.length;
+ var strOut = "";
+
+ var tokenOpened = "";
+ var tokenRCounter = 0;
+
+ var tContent = "";
+ var tOptions = "";
+ var cTTemp;
+
+ for( var i = 0; i < l; i ++ )
+ {
+ var c = str;
+
+ if( c == "[" )
+ {
+ var TokenClosing = "";
+ var mOptions = "";
+ var mToken = "";
+
+ cTTemp = str[ ++ i ];
+
+ if( cTTemp == "/" )
+ {
+ TokenClosing = cTTemp;
+ i ++;
+ }
+
+ // Get Token Name
+ for( cTTemp = str; i < l; i ++, cTTemp = str )
+ {
+ if( IsToken( cTTemp ) )
+ {
+ mToken += cTTemp;
+ continue;
+ }
+
+ break;
+ }
+
+ // No token opened but we got a closing token??!
+ if( !tokenOpened && TokenClosing )
+ {
+ throw new Error( "Syntax Error, Unexpected closing [/" + mToken + "]" );
+ }
+
+ // skip this if we didn't get a token
+ if( !mToken )
+ {
+ // BLABBLAB( "-------> " + cTTemp );
+ if( tokenOpened ) tContent += "[" + cTTemp;
+ else strOut += "[" + cTTemp;
+
+ continue;
+ }
+
+ // BLABBLAB( "Got Token: " + mToken );
+
+ // If said token is not the opened token
+ if( tokenOpened )
+ {
+ var tSkip = ( tokenOpened != mToken );
+
+ if( !tSkip )
+ {
+ if( TokenClosing && 0 < tokenRCounter )
+ {
+ tokenRCounter --;
+ tSkip = true;
+ }
+ // Nested token by the same name
+ // i.e. tokenOpened == mToken
+ else if( !TokenClosing )
+ {
+ tokenRCounter ++;
+ tSkip = true;
+ }
+ }
+
+ if( tSkip )
+ {
+ tContent += "[" + TokenClosing + mToken + "]";
+ continue;
+ }
+ }
+
+ // BLABBLAB( "After Token Name: \"" + cTTemp + "\"" );
+
+ // Closing token does not allow options to be set
+ if( TokenClosing && cTTemp == "]" && mToken == tokenOpened )
+ {
+ // BLABBLAB( "TokenClosed[" + mToken + "]" );
+ // BLABBLAB( "{{{" );
+ // BLABBLAB( tOptions );
+ // BLABBLAB( tContent );
+ // BLABBLAB( "}}}" );
+
+ strOut += handler( mToken, tOptions, tContent );
+
+ mToken
+ = tokenOpened
+ = tOptions
+ = tContent
+ = "";
+ continue;
+ }
+
+ // consume the token options
+ for( cTTemp = str; i < l; i ++, cTTemp = str )
+ {
+ // Unexpected open bracket, invalidate this matching
+ if( cTTemp == "[" )
+ {
+ // BLABBLAB( "\"" + mToken + "\" is in fact not a token" );
+
+ strOut += "[" + mToken;
+
+ // Step back and rematch from here
+ i --;
+
+ mToken
+ = tOptions
+ = tContent
+ = "";
+
+ break;
+ }
+ else if( cTTemp == "]" )
+ {
+ tokenOpened = true;
+ // BLABBLAB( "Token[" + mToken + "] Option: " + tOptions );
+ break;
+ }
+
+ tOptions += cTTemp;
+ }
+
+ tokenOpened = mToken;
+ }
+ else if( tokenOpened )
+ {
+ tContent += c;
+ }
+ else strOut += c;
+ }
+
+ if( tokenOpened )
+ {
+ throw new Error( "Syntax Error, Missing: [/" + tokenOpened + "]" );
+ }
+
+ return strOut;
}
static ParseProps( prop )
@@ -90,7 +247,7 @@ class BBCodeParser
var tokens = {};
// Tokenizer
- var __parse = function( matches, type, props, content )
+ var __parse = function( type, props, content )
{
var name = type.toLowerCase();
@@ -139,5 +296,4 @@ class BBCodeParser
}
}
-
module.exports = BBCodeParser;
The advantage of parsing the syntax manually allows me to detect wether there is a syntax error on the document. But I am not going to do that. From the above code I used a passive matching that just consumes everything unexpected ( such as brackets [] ) as content. However some valid but unmatched open / closing tags like [/this] will throw an exception.
But since the bbcodes are generated dynamically in AstroEdit. I shouldn't worried much about invalid syntax. Actually, I could just assumed that the syntax is always right!
Test result
root@debian:/home/penguin# ab -r -k -c 20 -n 250 http://blog.botanical.astropenguin.net/
This is ApacheBench, Version 2.3 <$Revision: 1604373 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking blog.botanical.astropenguin.net (be patient)
Completed 100 requests
Completed 200 requests
Finished 250 requests
Server Software: lighttpd/1.4.39
Server Hostname: blog.botanical.astropenguin.net
Server Port: 80
Document Path: /
Document Length: 17622 bytes
Concurrency Level: 20
Time taken for tests: 4.027 seconds
Complete requests: 250
Failed requests: 0
Keep-Alive requests: 250
Total transferred: 4471770 bytes
HTML transferred: 4405500 bytes
Requests per second: 62.08 [#/sec] (mean)
Time per request: 322.177 [ms] (mean)
Time per request: 16.109 [ms] (mean, across all concurrent requests)
Transfer rate: 1084.36 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 4.5 0 18
Processing: 47 317 558.0 198 3945
Waiting: 46 314 555.2 197 3929
Total: 47 318 561.5 198 3963
Percentage of the requests served within a certain time (ms)
50% 198
66% 218
75% 230
80% 240
90% 276
95% 740
98% 2984
99% 3952
100% 3963 (longest request)Damn! That was like 1900% performance boost! I am so happy with that.
Now the profiling:
[JavaScript]:
ticks total nonlib name
3 0.4% 0.4% Builtin: CallFunction_ReceiverIsNotNullOrUndefined
1 0.1% 0.1% Stub: StoreFieldStub
1 0.1% 0.1% Stub: LoadFieldStub
1 0.1% 0.1% Stub: LoadConstantStub
1 0.1% 0.1% Stub: FastNewContextStub
1 0.1% 0.1% LazyCompile: ~fs.Stats fs.js:144:20
1 0.1% 0.1% LazyCompile: ~emitTwo events.js:104:17
1 0.1% 0.1% LazyCompile: ~emit events.js:136:44
1 0.1% 0.1% LazyCompile: ~StringDecoder.write string_decoder.js:64:41
1 0.1% 0.1% LazyCompile: ~RoundRobinHandle.handoff cluster.js:186:46
1 0.1% 0.1% LazyCompile: ~<anonymous> cluster.js:197:55
1 0.1% 0.1% LazyCompile: *slice native string.js:242:21
1 0.1% 0.1% LazyCompile: *<anonymous> cluster.js:745:18
1 0.1% 0.1% Handler: An IC handler from the snapshot
[Summary]:
ticks total nonlib name
16 2.0% 2.0% JavaScript
783 95.5% 96.2% C++
2 0.2% 0.2% GC
6 0.7% Shared libraries
15 1.8% Unaccounted
[Bottom up (heavy) profile]:
Note: percentage shows a share of a particular caller in the total
amount of its parent calls.
Callers occupying less than 2.0% are not shown.
ticks parent name
656 80.0% syscall
28 3.4% node::ContextifyScript::New(v8::FunctionCallbackInfo<v8::Value> const&)
28 100.0% v8::internal::Builtin_HandleApiCallConstruct(int, v8::internal::Object**, v8::internal::Isolate*)
28 100.0% LazyCompile: ~runInThisContext node.js:346:28
28 100.0% LazyCompile: ~NativeModule.compile node.js:427:44
28 100.0% LazyCompile: ~NativeModule.require node.js:361:34
5 17.9% LazyCompile: ~startup node.js:12:19
5 17.9% Function: ~<anonymous> module.js:1:11
3 10.7% Function: ~<anonymous> util.js:1:11
3 10.7% Function: ~<anonymous> internal/child_process.js:1:11
2 7.1% LazyCompile: ~setupGlobalVariables node.js:215:32
2 7.1% LazyCompile: ~Module._load module.js:381:24
2 7.1% Function: ~<anonymous> stream.js:1:11
2 7.1% Function: ~<anonymous> cluster.js:1:11
1 3.6% LazyCompile: ~setupGlobalTimeouts node.js:243:31
1 3.6% LazyCompile: ~listenAfterLookup net.js:1393:29
1 3.6% Function: ~<anonymous> timers.js:1:11
1 3.6% Function: ~<anonymous> child_process.js:1:11Looks good to me. Note that I didn't even cached the article yet. Aaaand it seemed I don't have to cache it now:P
Fri Jun 03 2016 09:19:13 GMT+0000 (Coordinated Universal Time)
Last modified: Wed Jan 04 2017 02:14:28 GMT+0000 (Coordinated Universal Time)
Comments
No comments here.
Do you even comment?
website:
Not a valid website
Invalid email format
Please enter your email
*Name:
Please enter a name
Submit
抱歉,Google Recaptcha 服務被牆掉了,所以不能回覆了