Two tricky techniques for preserving character entities in XSLT 2.0

Two tricky techniques for preserving character entities in XSLT 2.0.

Thanks to a recent story by Bob DuCharme for XML.com, entitled "Entity and Character References," whose focus is XSLT 2.0, I found myself pondering a problem typical for those who take XML documents through multiple parsers while working through various transformations or operations. DuCharme succinctly observes that while a parser's job is to take entity references (in SGML those symbolic names that start with an ampersand and end...

with a semicolon, like the character entities & for ampersand and < for the less-than symbol) and replace them with their values. Trouble is, if you're trying to create output that needs and expects characters entities in the final document, you're in a bit of a pickle if a parser somewhere early in the chain replaces & with "&" and < with "<".

But there is a two-step maneuver that makes this relatively easy to gloss, without having to store those items as unparsed character data in CDATA sections, or through use of XSLT's disable-output-escaping attribute. By first using numeric references rather than character entities -- that is &#38; rather than &amp; and &#60; in ISO-Latin-1 -- you can use XSLT to transform this stuff exactly as you wish during a final editing pass (or at least, something that follows after the last parser that might otherwise make substitutions you don't want). This, of course, is step number one.

Step number two depends on using the character map feature in XSLT 2.0, whereby you can convert input strings consisting of specific characters into whatever you instruct your markup to do. In this case, you can take numeric character references (which are not entities, and hence not parsed) and turn them into character entities so they're ready when you need them. A character map basically defines a substitution table that the XSLT processor uses so that when it finds a certain string, instead of writing it directly to the results tree, it inserts a corresponding replacement instead. Thus, the following example:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">

  <xsl:output use-character-maps="num2ent"/>

  <xsl:character-map name="num2ent">
    <xsl:output-character character="&#38;" string="&amp;"/>
    <xsl:output-character character="&#60;" string="&lt;"/>
  </xsl:character-map>

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

This markup does nothing more than write the entire results tree verbatim to output except when it encounters the two numeric entities specified, in which case it replaces them with the desired character entities. Obviously, thanks to Mr. DuCharme, you can grab this code and add whatever <XSL:output-character...> replacements you want and you've got a handy-dandy tool. This is particularly useful when you have to run content through other applications (like MS Office components) that may not perform entirely sensible replacements for you, or when you want to create markup as final output (something anybody who teaches markup must do all the time). Very handy indeed!


Ed Tittel is a writer, trainer, and consultant based in Austin, TX, who writes and teaches on XML and related vocabularies and applications. E-mail Ed at etittel@lanw.com.


This was first published in July 2004

Dig deeper on XML and XML schema

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

-ADS BY GOOGLE

SearchSoftwareQuality

SearchCloudApplications

SearchAWS

TheServerSide

SearchWinDevelopment

Close