XML Schema Regular Expressions
The W3C XML Schema standard defines its own regular expression flavor. You can use it in the pattern
facet of simple type definitions in your XML schemas. E.g. the following defines the simple type “SSN” using a regular expression to require the element to contain a valid US social security number.
<xsd:simpleType name="SSN">
<xsd:restriction base="xsd:token">
<xsd:pattern value="[0-9]{3}-[0-9]{2}-[0-9]{4}"/>
</xsd:restriction>
</xsd:simpleType>
Compared with other regular expression flavors, the XML schema flavor is quite limited in features. Since it’s only used to validate whether an entire element matches a pattern or not, rather than for extracting matches from large blocks of data, you won’t really miss the features often found in other flavors. The limitations allow schema validators to be implemented with efficient text-directed engines.
Particularly noteworthy is the complete absence of anchors like the caret and dollar, word boundaries, and lookaround. XML schema always implicitly anchors the entire regular expression. The regex must match the whole element for the element to be considered valid. If you have the pattern regexp
, the XML schema validator will apply it in the same way as say Perl, Java or .NET would do with the pattern ^regexp$
. If you want to accept all elements with regex
somewhere in the middle of their contents, you’ll need to use the regular expression .*regex.*
. The two .*
expand the match to cover the whole element, assuming it doesn’t contain line breaks. If you want to allow line breaks, you can use something like [\s\S]*regex[\s\S]*
. Combining a shorthand character class with its negated version results in a character class that matches anything.
XML schemas do not provide a way to specify matching modes. The dot never matches line breaks, and patterns are always applied case sensitively. If you want to apply literal
case insensitively, you’ll need to rewrite it as [lL][iI][tT][eE][rR][aA][lL]
.
XML regular expressions don’t have any tokens like \xFF
or \uFFFF
to match particular (non-printable) characters. You have to add them as literal characters to your regex. If you are entering the regex into an XML file using a plain text editor, then you can use the  XML syntax. Otherwise, you’ll need to paste in the characters from a character map.
Lazy quantifiers are not available. Since the pattern is anchored at the start and the end of the subject string anyway, and only a success/failure result is returned, the only potential difference between a greedy and lazy quantifier would be performance. You can never make a fully anchored pattern match or fail by changing a greedy quantifier into a lazy one or vice versa.
XML Schema regular expressions support the following:
Note that the regular expression functions available in XQuery and XPath use a different regular expression flavor. This flavor is a superset of the XML Schema flavor described here. It adds some of the features that are available in many modern regex flavors, but not in the XML Schema flavor.
XML Character Classes
Despite its limitations, XML schema regular expressions introduce two handy features. The special short-hand character classes \i
and \c
make it easy to match XML names. No other regex flavor supports these.
Character class subtraction makes it easy to match a character that is in a certain list, but not in another list. E.g. [a-z-[aeiou]]
matches an English consonant. This feature is now also available in .NET regex engines. It is particularly handy when working with Unicode properties. E.g. [\p{L}-[\p{IsBasicLatin}]]
matches any letter that is not an English letter.