[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] XSLT 2.0 : Unicode hex notation in regular expressions


Subject: Re: [xsl] XSLT 2.0 : Unicode hex notation in regular expressions
From: John Besch <jbesch@xxxxxxx>
Date: Mon, 12 Jun 2006 15:25:34 -0400

> How, for example, to use a useful syntax like
>   matches(.,'\p{Script:Arabic}+') ?
>
>schema-2 says: http://www.w3.org/TR/xmlschema-2/#regexs
>
>[Definition:] [Unicode Database] groups code points into a number of
>blocks such as Basic Latin (i.e., ASCII), Latin-1 Supplement, Hangul
>Jamo, CJK Compatibility, etc. The set containing all characters that
>have block name X (with all white space stripped out), can be identified
>with a block escape \p{IsX}. The complement of this set is specified
>with the block escape \P{IsX}. ([\P{IsX}] = [^\p{IsX}]).
>...
>For example,
>the 7block escape7 for identifying the ASCII characters is \p{IsBasicLatin}.
>
>so that would be \p(IsArabic)
>
>David



I want to use the above construct to detect Japanese characters, and so I am using the
following xsl:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
     <xsl:output method="xml" indent="yes" encoding="UTF-8" />
     <xsl:template match="/text">
        <xsl:for-each select="tokenize(.,'\s+')">
          <word>
            <xsl:attribute name="language">
              <xsl:choose>
                 <xsl:when test="matches(.,'\p{IsCJKCompatibility}+')">Japanese</xsl:when>
                 <xsl:when test="matches(.,'\p{IsBasicLatin}+')">Latin</xsl:when>
                 <xsl:otherwise>Unknown</xsl:otherwise>
              </xsl:choose>
            </xsl:attribute>
          </word>
        </xsl:for-each>
     </xsl:template>
</xsl:stylesheet>

However, the Japanese characters in my input, which are encoded in UTF-8, come out flagged as Latin
or Unknown.  What am I doing wrong?  How do I get this to recognize the Japanese characters?

Thanks for any help you can offer.

John Besch


Current Thread