July 5th, 2006
Stripping HTML Tags From User Inputs
You want to believe the best about our users, and for most of them it is appropriate to do so. Unfortunately, there’s always the danger of people coming to the sites you create with the intent to stir up trouble. While it’s less of a problem, you also have to be concerned about innocent users submitting forms with inadvertently dangerous inputs. One (of many!) ways they can attempt to cause mischief is to enter unexpected things into web forms. In this piece, I’m going to specifically address user inputs that attack dynamic web pages, but know databases, mail services, and almost anything that exists on or interacts with your web server is a potential target.
As a simple example, imagine a guest book that allows visitors to leave comments. Those comments are then automatically displayed on the website without the owner of the site needing to lift a finger. Efficient. Dynamic fresh content. This feature has some things going for it.
Now say a user enters the following text for a comment:
Uh oh. If this is added to the page as a “comment”, visitors who come to this page are redirected to another site, and probably not one they want to visit. So how do you prevent this? A very easy and comprehensive approach is to remove all tags from the data a user inputs. In the example above, the users comment would display window.location = “http://www.spamcity.com”; on the screen, but visitors would not be redirected.
The means for achieving this is a little different depending on the back-end technology you are using. PHP has a built in function to handle this issue called strip_html. Use a line similar to this:
$cleaned_string = strip_html($_REQUEST[“user_string”]);
ColdFusion has a function HTMLEditFormat that does something similar. Instead of removing tags, it escapes them so that the code itself will be displayed. This is great for pieces like this where I am displaying examples of code that I don’t want to execute, but we can do better. Using a regular expression replacement you can duplicate the functionality of PHP’s strip_html.
<cfset cleaned_string = ReReplace(form.user_string, “<[^>]*>”, “”, “ALL”)>
The regular expression approach also works in ASP.
Set objRegExp = New RegExp
cleanedString = objRegExp.Replace(userString, “”)
Stripping tags is not a panacea for every malicious entry a user could make, but it does let you quickly shore up a weakness that is extremely easy to exploit. If you aren’t currently stripping tags from your user input data, I highly recommend starting.