Why Sequence Matters in Regular Expressions

July 21, 2023
Rss Fetcher

And here’s how to use them correctly

A digital magnifier researching programming source code. — Image generated by Gencraft

Sometimes, we underestimate the importance of the order of the characters in the Regular Expressions pattern. Sometimes… Okay, let’s say that none of us have ever thought about this. Come on, let’s face it.

Example

During a code review on a Java project with the support of Fortify SCA, a Header Manipulation came out, one of the typical problems when you don’t sanitize the input data.

The code in question looked very similar to the following:

protected void error(HttpServletRequest request, HttpServletResponse response, Error error) {
  try {
    String errorMessage = error.getMessage();
    log(errorMessage);

    response.setContentType(request.getContentType());
    response.getWriter().print(errorMessage);
  } catch (Exception e) {
    throw new ServletException(e);
  }
}

The problem is that the ContentType is taken from a request and inserted into a response, without checking its content, which could be dangerous (I will talk about it in detail, maybe in a separate article).

The developer accepted this report and implemented “a particular filter using a RegEx because it is powerful and customizable.”

His solution was, therefore, to create the following method to sanitize that field:

public static String sanitizeContentType(String input) {
  return input
    .replaceAll("[^a-zA-Z0-9;=-\\/", "")
    .replaceAll("\s{2,}", " ")
    .replaceAll("\r", "")
    .replaceAll("\n", "");
}

In detail:

[^a-zA-Z0–9;=-\/] intercepts all characters other than semicolons, equal, minus, slash, backslash, all numbers, and all letters a to z, both lowercase and uppercase.
s{2,} intercepts all sequences with more than one space.
r intercepts carriage return.
n intercepts the escape sequence for the new line.

Since I never trust much in general, and above all, I don’t understand why I’d need to rewrite something when there are several more efficient and advanced libraries that do this kind of thing, I decided to do a little test.

Test

As usual, I created a small program to do the tests:

public class RegExSanitizer {
 public static void main(String[] args) {
   if (args.length == 0) {
     System.out.println("Usage is: java RegExSanitizer input");
     System.exit(0);
   }

   String input2sanitize = args[0];
   System.out.println("String to sanitize: " + input2sanitize);
   System.out.println("Sanitized string: " + sanitize(input2sanitize));
 }

  public static String sanitize(String input) {
    return input.replaceAll("[^a-zA-Z0-9;=-\\/]", "")
      .replaceAll("\s{2,}", " ")
      .replaceAll("\r", "")
      .replaceAll("\n", "");
  }
}

Being a function to sanitize the inputs, the first test passed a rather strange string, but not much for the truth.

C:RegExSanitizer> javac RegExSanitizer.java
C:RegExSanitizer> java RegExSanitizer Bob%%0d%00d%0aa<script>alert('document.domain')</script>
String to sanitize: Bob%%0d%00d%0aa<script>alert('document.domain')</script>
Sanitized string: Bob0d00d0aascript>alertdocumentdomain/script>

The first thing that immediately catches the eye is that the closed hook brackets have not been eliminated. And already we start badly.

I did some tests with trusty Regex101 starting from the regex created by the developer and studying the pattern. The nice thing about Regex101 is that every single sequence and its meaning are highlighted by passing the mouse over it. In addition, the EXPLANATION box on the right explains it in detail point by point.

And that’s exactly how I discovered this:

=- matches a single character in the range between =(index 61) and (index 92) (case sensitive)

That is, the sequence =- intercepts any character between index 61 and index 92.

Looking at the ASCII Table, between index 61 and index 92, there are several characters, including the right angle bracket, with index 62 (those who work with XSS probably already guessed, given the use of < and > in certain payloads, the HTML code of the angle brackets).

To fix all this, I changed the pattern sequence like so:

[^\\/a-zA-Z0-9;=-]

And they all lived happily ever after.

Conclusion

Regular Expressions remain fantastic things and a world to be discovered. Their usefulness is immense, and I would use them when I have to ask someone for the time.

The fact remains that there are many much more reliable libraries than us for sanitizing inputs, and there is probably no need to reinvent the wheel every time.

For heaven’s sake, nothing is perfect. Maybe by using one of these libraries, you will find an improperly sanitized input.

And there is applause because you won.

Why Sequence Matters in Regular Expressions was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.

And here’s how to use them correctly

Example

Test

Conclusion

Previous Post

Next Post

Solutions

Regions Covered