Using XSLT 2.0 to remove HTML Tags from a Value

htmlBuilding applications in SharePoint, as we do here at CorasWorks, often requires dealing with the output of what SharePoint calls “Rich Text” and “Enhanced Rich Text” fields. At the end of the day, if you’re attempting to do any XSL transformation on XML data that includes HTML content, you run the risk of the HTML causing errors when parser attempts to process it. One old trick is to use the “disable-output-escaping” attribute but this itself can be limiting.

Enter XSLT 2.0, which is supported within the CorasWorks Application Service (CAPS), and a simple yet elegant solution is only 6-lines of XSL away!

Take an example column value you get from SharePoint for one of these rich text fields:

<div   clas=”FDRDS43543fSDF”><font size=”3″>This is what I <em>really </em>need to test –   does it <font color=”#ff0000″>strip</font>   out   the <strong>HTML </strong>in here…?</font></div>

 

For the purposes of your transformed output though, you really just need the raw text, stripped of all HTML tags; more like this:

This   is what I really need to test – does it strip out the HTML in here…?

 

The way to achieve this result with XSLT 2.0 is:

<xsl:variable name=”StripHTML”><![CDATA[<\s*\w.*?>|<\s*/\s*\w\s*.*?>]]></xsl:variable><xsl:analyze-string select=”@ows_Body” regex=”{$StripHTML}”><xsl:non-matching-substring><xsl:value-of select=”.”/>

</xsl:non-matching-substring>

</xsl:analyze-string>

 

Leveraging the analyze-string element, a Regular Expression statement (set via the “regex” attribute) is used to parse an input (set via the “select” attribute); then, every substring within the source input is bucketed between two sets – those that matched the Regular Expression (xsl:matching-substring) and those that do not (xsl:non-matching-substring).

In this use, the Regular Expression catches any opening or closing HTML tag, regardless of name, attributes, etc. By then discarding, or not defining, any instruction for those matching substrings, they’re effectively dropped. The non-matching substrings are looped through and outputted as-is, leaving a nicely scrubbed value, devoid of any HTML tags.

The new analyze-string element in XSLT 2.0 is a useful and powerful upgrade over XSLT 1.0, and one we’re pleased to have access to within SharePoint thanks to CAPS!

Comments are closed.