Matching Floating Point Numbers with a Regular Expression
This example shows how you can avoid a common mistake often made by people inexperienced with regular expressions. As an example, we will try to build a regular expression that can match any floating point number. Our regex should also match integers and floating point numbers where the integer part is not given. We will not try to match numbers with an exponent, such as 1.5e8 (150 million in scientific notation).
At first thought, the following regex seems to do the trick: [-+]?[0-9]*\.?[0-9]*
. This defines a floating point number as an optional sign, followed by an optional series of digits (integer part), followed by an optional dot, followed by another optional series of digits (fraction part).
Spelling out the regex in words makes it obvious: everything in this regular expression is optional. This regular expression considers a sign by itself or a dot by itself as a valid floating point number. In fact, it even considers an empty string as a valid floating point number. If you tried to use this regex to find floating point numbers in a file, you’d get a zero-length match at every position in the string where no floating point number occurs.
Not escaping the dot is also a common mistake. A dot that is not escaped matches any character, including a dot. If we had not escaped the dot, both 4.4
and 4X4
would be considered floating point numbers.
When creating a regular expression, it is more important to consider what it should not match, than what it should. The above regex indeed matches a proper floating point number, because the regex engine is greedy. But it also matches many things we do not want, which we have to exclude.
Here is a better attempt: [-+]?([0-9]*\.[0-9]+|[0-9]+)
. This regular expression matches an optional sign, that is either followed by zero or more digits followed by a dot and one or more digits (a floating point number with optional integer part), or that is followed by one or more digits (an integer).
This is a far better definition. Any match must include at least one digit. There is no way around the [0-9]+
part. We have successfully excluded the matches we do not want: those without digits.
We can optimize this regular expression as: [-+]?[0-9]*\.?[0-9]+
.
If you also want to match numbers with exponents, you can use: [-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?
. Notice how I made the entire exponent part optional by grouping it together, rather than making each element in the exponent optional.
Finally, if you want to validate if a particular string holds a floating point number, rather than finding a floating point number within longer text, you’ll have to anchor your regex: ^[-+]?[0-9]*\.?[0-9]+$
or ^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$
.