Javascript Tutorial -- Regular Expressions in Javascript



In the next lesson, we will discuss forms, which are a way of obtaining user input in Javascript. Unfortunately, there are a number of people on-line who like to find ways to break web applications for fun, fame or money -- or more often, some combination of these. The competent web developer, therefore, must always have security in mind when writing applications that will be used on public-facing (or even internal) web sites.

On the front lines of web application security is user input validation. Buffer overflows, SQL injection and cross-site scripting are just some of the many ways to break a web app through user input. Fortunately, it is not too difficult to prevent many of these attacks through a tool called "regular expressions".

Regular expressions, or "regexes" as they are frequently called, are a pattern matching tool that have been used by sys admins and developers on Unix and Unix-like operating systems since the dawn of time (January 1st, 1970, IIRC). The idea behind using a regex to help secure your scripts is that it is possible to describe in somewhat generic terms the type of data you expect to receive from your users. For example, a date should always consist of a month (a number between 1 and 12 that may or may not contain a leading zero), a slash ("/"), a day (a number between 1 and 31 that may or may not contain a leading zero -- we'll ignore that some months only have 30 days and that February only has 28 or 29 days for now), another slash, and a two or four-digit year. It is possible to create a regular expression to search for this combination of numbers and slashes later. For now, let's start with something a little easier.

Suppose you had an array of sentences, and you wanted to print any of the sentences that contained the word "cat". The following regex would match "cat":
        /cat/
      
OK, that was easy enough. Do you think this pattern would match the word "caaaaat"? Nope -- regexes are very litteral. The pattern above is looking for some sequence of characters that contains the letter "c" followed by the letter "a" followed by the letter "t". What about the word "catatatat"? Yep -- while the word "catatatat" contains several addition letters after the substring "cat", it still contains the exact sequence we were looking for at the very beginning of the word.

What if you wanted to find the word "cat" at the beginning or at the end of the sentence? There are a number of special characters defined in most regex implementations, including the so-called "anchor characters" "^" and "$". The "^" character is used to mark the beginning of a pattern and the "$" is used to mark the end. So, the regex...:
        /^cat/
      
...would find the word "cat" at the beginning of a string, and the regex...:
        /cat$/
      
...would find it at the end of a string. What do you suppose would happen if you used the pattern...:
        /^cat$/
      
...to match something? You guessed it -- this regex would match a string composed solely of the word "cat". This is actually a very useful regex (okay, not the "cat" part per se, but some pattern bounded by these two anchor characters), as we'll see a little later in this lesson.

Unfortunately, when validating user input, you typically aren't looking for specific words in your input data. Most of the time, you are looking for data that is formatted in a specific way -- like all characters, all lower (or upper) case characters, all digits, a three digit sequence followed by a hyphen followed by a four digit sequence, etc. Fortunately, there is a way to encode these types of patterns in a regex too, using what is called a "character class".

A character class is a pattern enclosed in square brackets. For example, this regex would match any single upper or lower case letter:
        /^[a-zA-Z]$/
      
...and this regex would match zero or more numeric digits:
        /^[0-9]*$/
      
Did you notice the asterisk behind the numeric character class? That is what allows you to match zero or more occurrences of the preceding character class. If you want to specify that there must be at least one digit present in the pattern, use a plus sign, instead. Finally, instead of an asterisk or a plus sign, you can use a number enclosed in curly braces ("{ }") to require a specific number of digits or you can use the form "{x, y}" to require at least x and no more than y occurrences of the preceding character class. Here are some examples to help clarify:
        //match at least one letter a-z or A-Z
        /^[a-zA-Z]+$/

        //match at least 2 and no more than 5 digits
        /^[0-9]{2,5}$/

        //match exactly 17 letters and/or digits
        /^[a-zA-Z0-9]{17}$/
      
One last important thing to remember: some characters have special meanings inside a regex. In particular, we have already seen how the anchor characters specify the beginning or end of a pattern and how the hyphen matches a range of characters. Also, although it may not have been immediately obvious, the slash character is used to designate the boundaries of the pattern to match against. How then do you match these special characters inside a regex? In the Unix world, preceeding a special character with a back slash ("\") transforms a special character into its literal value. So, this regex...:
        /[a\-z]*/
      
...means to literally match any sequence of the characters "a", a hyphen, or "z", rather than the more common "any lower case letter".

So...now let's do something useful with regexes. Suppose we want to match a numerical date, like "12/07/1941". We don't care what the date really is; we just want to match a date in this format. We could use the following regex to do this:
        /[0-9]{1,2}\/[0-9]{1,2}\/([0-9]{2}|[0-9]{4})/
      
In all honesty, this isn't a really good regex. The way the example above is written, "00/00/0000" would be a valid date. Unfortunately, there is no zero-th day of the zero-th month in 0000, or any other year. And 99/99/9999 would also be a valid date even though it is just as absurd as 00/00/0000. We could write a more thorough regex to test for cases such as these, but this web page is just intended to be an introduction to regexes, and I figured this regex was complicated enough for now.

So far, we have just scratched the surface of regexes. They are incredibly powerful, but regexes are essentially a programming language unto themselves. O'Reilly has entire books devoted to the subject of crafting elegant, efficient regexes. To be honest, while I am reasonably good at putting together a regex, I am not truly qualified to write a thorough discussion on the subject. If you really want to learn more about designing regexes, I would suggest searching on-line (try Steve Ramsey's Guide to Regular Expressions or Google) the subject or maybe a quick trip to Amazon.com for a more in-depth look at the subject. If you have a *nix computer at your disposal with Perl installed, you can also try "man perlretut" for a pretty good regular expression tutorial.

So...how do we put regexes to work in Javascript? If you recall from the last lesson, any variable that contains text data is actually a string object, and string objects have a number of methods available for you to use in your scripts. One of these methods is the "match" method, which is used to match the string object against a regular expression. Here is an example:
        <script type="tex/javascript">
          <!--
            var IpAddress = "192.168.192.168"
            var Regex = /^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$/

            if (IpAddress.match(Regex))
              {
                document.write(IpAddress + " is a valid IP address.<br>")
              }
            else
              {
                document.write(IpAddress + " is not a valid IP address.<br>")
              }
          //-->
        </script>
      
...and here is a sample script using regexes.