Building a better Tag-Parser

The Problem

If you read the code of my Extensible BBCode module for [[Drupal]], you'll notice that my tag parsing algorithm is kind of complex. In two steps, the tags are first "paired" (inserting a matching ID into the opening and closing tag) and then "rendered", all by an evaluated regular expression that calls another function. The process certainly ensures that all tags are balanced - and it even allows some tags to be rendered before others are, regardless of nesting.

But for all its voodoo magic and messing about, it is vulnerable to the same bug as most primitive BBCode parsers. Namely, allowing input that will result in this output:

Now, the [[quirks mode]] of all [[layout engines]] are very good at dealing with simple cases like that. Inline markup will be rendered correctly even with overlapping elements.

But what if it includes block elements? In [[phpBB]], I recently noticed that it is extremely easy to disrupt the page layout using just two simple tags; [spoiler] and [quote].

The following overlapping input will be accepted:

The following output will (roughly) be generated:

Unlike the inline markup, several major layout engines ([[Gecko (layout engine)|]], [[WebKit]] and [[Presto (layout engine)|]] are the ones tested) take issue with this. The quirk mode behavior in all three drops the three unmatched div tags inside the blockquote element, and then applies the three closing div tags to parent div elements higher up in the tree. This naturally wreaks havoc on the page. (I'd almost consider this a mild denial-of-service vulnerability because any malicious user can break the page layout.)

XBBCode is vulnerable to the same, because in spite of all its careful "pair matching", it never actually checks that the tag tree is nested correctly.

The Algorithm

So now I'm working on a much simpler and in my opinion more stable approach. I can only suppose it hadn't occurred to me earlier because I was thinking that BBCode had to be rendered "in-place", replacing each tag in the string where it was found. It seems that a better approach is to first find all tags, and then gradually build the output from substrings of the original text.

In Pseudocode (the actual PHP code is not as clean, alas):

You can see that even though the algorithm and the data it operates on is iterative, the text is rendered recursively from the inside out: Whenever a tag is closed, it is rendered and the output appended to the parent element's content.

If the tag being closed lies further down the stack, then all intermediate unclosed tags are discarded (though any complete tags inside them are kept). In the end, the same is considered to happen to the virtual "root" tag, that encloses the entire input - unclosed tags are discarded. (Note that "discarded" means "displayed in the output as unrendered BBCode", as is customary.)

The Costs

The first and greatest downside to this method is that tags cannot be weighted anymore. Every tag renderer will receive its content with all nested BBCode tags already rendered. Deferring the rendering of a tag is just not feasible in this algorithm.

However, this can be trivially circumvented by using a feature that is a lot easier to implement. Tags can be given a "nocode" property, which will tell this algorithm to ignore any tags nested inside it. The tag renderer could then independently run the BBCode filter on its content to render the tags that were left unrendered earlier.

(I'm not yet at the point where a dangling nocode tag will not stop enclosed tags from being rendered. It's not trivial, alas.)

News Category:

Drupal

Keywords:

xbbcode

Building a better Tag-Parser

The Problem

The Algorithm

The Costs

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112