Character Class Intersection
Character class intersection is supported by Java, and by Ruby 1.9 and later. It makes it easy to match any single character that must be present in two sets of characters. The syntax for this is [class&&[intersect]]
. You can use the full character class syntax within the intersected character class.
If the intersected class does not need a negating caret, then Java and Ruby allow you to omit the nested square brackets: [class&&intersect]
.
The character class [a-z&&[^aeiuo]]
matches a single letter that is not a vowel. In other words: it matches a single consonant. Without character class subtraction or intersection, the only way to do this would be to list all consonants: [b-df-hj-np-tv-z]
.
The character class [\p{Nd}&&[\p{IsThai}]]
matches any single Thai digit. [\p{IsThai}&&[\p{Nd}]]
does exactly the same.
Intersection of Multiple Classes
You can intersect the same class more than once. [0-9&&[0-6&&[4-9]]]
is the same as [4-6]
as those are the only digits present in all three parts of the intersection. In Java and Ruby you can write the same regex as [0-9&&[0-6]&&[4-9]]
, [0-9&&[0-6&&4-9]]
, [0-9&&0-6&&[4-9]]
, or just [0-9&&0-6&&4-9]
. The nested square brackets are only needed if one of the parts of the intersection is negated.
If you do not use square brackets around the right hand part of the intersection, then there is no confusion that the entire remainder of the character class is the right hand part of the intersection. If you do use the square brackets, you could write something like [0-9&&[12]56]
. In Ruby, this is the same as [0-9&&1256]
. But Java has bugs that cause it to treat this as [0-9&&56]
, completely ignoring the nested brackets.
The characters 56
in [0-9&&[12]56]
are an error. This way there is no ambiguity about their meaning.
You also shouldn’t put && at the very start or very end of the regex. Ruby treats [0-9&&]
and [&&0-9]
as intersections with an empty class, which matches no characters at all. Java ignores leading and trailing && operators.
Intersection in Negated Classes
The character class [^1234&&[3456]]
is both negated and intersected. In Java, negation takes precedence over intersection. Java reads this regex as “(not 1234) and 3456”. Thus in Java this class is the same as [56]
and matches the digits 5 and 6. In Ruby, intersection takes precedence over negation. Ruby reads [^1234&&3456]
as “not (1234 and 3456)”. Thus in Ruby this class is the same as [^34]
which matches anything except the digits 3 and 4.
If you want to negate the right hand side of the intersection, then you must use square brackets. Those automatically control precedence. So Java and Ruby all read [1234&&[^3456]]
as “1234 and (not 3456)”. Thus this regex is the same as [12]
.
Notational Compatibility with Other Regex Flavors
The ampersand has no special meaning in character classes in any other regular expression flavors discussed in this tutorial. The ampersand is simply a literal, and repeating it just adds needless duplicates. All these flavors treat [1234&&3456]
as identical to [&123456]
.
Strictly speaking, this means that the character class intersection syntax is incompatible with the majority of other regex flavors. But in practice there’s no difference, because there is no point in using two ampersands in a character class when you just want to add a literal ampersand. A single ampersand is still treated as a literal by Java and Ruby.