PHP Strip Attributes Class For XML and HTML
In the process of making this site, I needed a facility to strip attributes from HTML elements. My first stop was on the strip_tags page in the PHP manual. However, the functions on there were pretty poor and borked out a lot. Google didn't provide any better results, so I ended up having to make one. The results are pretty good. After a few tweaks I ran over 10,000 tests on different web pages and didn't have any problems.
The class will work on any form of XML markup, not just HTML. This class doesn't strip tags. To do so you will need to use it in conjunction with strip_tags. It picks up any form of invalid attributes that get rendered by browsers like name=value and name = "value". So you can make sure your not gonna get owned by progressive HTML injectors and XSS kiddies :)
For the sake of making it easy to use, I just botched a required function that escapes strings for regular expression at the top of the class. For some reason PCRE doesn't escape properly. So, you may want to move this function elsewhere.
In the above example the allow variable sets attributes that are to be allowed on all elements, the exceptions variable sets attributes that are to be allowed on specific elements and the ignore variable sets elements that are to be totally ignored. So, the example will except id and class attributes on any element, src and alt attributes on img elements, href and title on a elements and will not ignore any tags.
Version History
| Version | Release | Notes |
|---|---|---|
| 0.1 | 27 Jan 2009 | Initial release |
| 0.2 | 16 Mar 2009 | Expanded support for XML node names (such as namespaces), fixed a bug with finding self-closing tags and expanded support for malformed attributes. |
| 0.2.1 | 23 Oct 2009 | Fixed parsing of elements that contain new lines. |
This class is available under the MIT License.
Comments
Is this released under a gpl compatible license?
Great class, but 2 issues.
1. The function isException returns an error and ends up stripping the tag entirely if it does not have an exception. Fix this by replacing:
if( $this->exceptions[$element_name] ) {
with:
if( array_key_exists($element_name, $this->exceptions) ) {
2. The class drops tags with numbers in them, such as h[1-5]. To fix this, in the findElements function, replace this:
preg_match_all( "/]*)>/i", $this->str, $elements );
with this:
preg_match_all( "/]*)>/i", $this->str, $elements );
Thanks,
Hi Matt. Thanks for your bug fixes. I'll make the necessary changes and upload the new version. I might as well add support for namespaces while I'm at it.
Re. the licensing, I haven't put it under one yet, but you can do pretty much anything you want with this.
is strips the forwardslash off of and makes it .
Hi there.
Great class - helped me a lot.
Just one thing - I used it on an XML String, and it took away the ? on the end of the xml declaration.
became - that shouldnt happen, right?
Except that - thank you.
Actually, you can just ignore these elements with the ignore property.
Thanks for your input Sebastian. The problem with the XML deceleration raises the question about XML processing instructions and DTDs. I will check this out later on today and post an updated class.
thanks a lot! It has been very useful :-)
hi again,
I've just found that if we don't have to remove an attribute, there is a notice that can produce an error in headers sent (test with an app made with cakephp). The solution is in line 55:
# Return the XML if there were no attributes to remove
if(count($nodes) > 0)
return $nodes;
else
return $this->str;
greetings
I don't know how to use it, I want to strip all inline-style attributes from my html page, what do I do? Thanks
Like the core PHP strip_tags function this class requires you to define what attributes you want to keep. For example if you have the following html
<div style="background: red;"><div>You could remove the style attributes simply by running the following code, where
$stris the above html snippet:If there are some attributes you want to preserve, you can define them like so:
In that example any id and class attributes will be preserved.
You can find out more about the functionality of this class by reading the above documentation.
can it be written in one function? like..
public function strip( $str, $allow=null, $exception=null, $ignore=null ) {
$this->allow=$allow;
$this->exception=$exception;
$this->ignore =$ignore ;
// statements here
}
Yes. You can do that if you want to, but there is not really much point in it.
Very nice and clean script. Hope it will help me in my current project...Thanks a lot
Thank you! I was trying to figure this out on my own and was going a little insane :)
Hi,
I had a problem with this class earlier today and was able to resolve it fairly quickly. It came down to the findAttributes() method. It seems unable to ascertain the difference between the different sort of delimiters (if any are present). I have replaced it with the following (in my implementation) however it does require that all newlines are replaced with whitespace - which is easy enough like so:..
$remove = array("\r", "\n");
$content = str_replace($remove, " ", $content);
The code I now have is as follows:
private function findAttributes( $nodes ) { for($i = 0; $i < count($nodes); $i++) { $att_string = $nodes[$i]['attributes']; $nodes[$i]['attributes'] = null; $atts = array(); for($j = 0; $j < strlen($att_string); $j++) { if($att_string{$j} === "=") { $key = ""; $key_st_idx = $j-1; while($att_string{$key_st_idx} == " ") { $key_st_idx--; } for($k = $key_st_idx; $k >= 0; $k--) { if($att_string{$k} == " " || $att_string{$k} == "'" || $att_string{$k} == "\"" || $k == 0) { //set key $key = trim(substr($att_string, $k+1, $key_st_idx-$k)); //break loop $k = -1; } } $val_st_idx = $j+1; $value_delim = null; $value = ""; while($att_string{$val_st_idx} == " ") { $val_st_idx++; } if($att_string{$val_st_idx} == "\"" || $att_string{$val_st_idx} == "'") { $value_delim = $att_string{$val_st_idx}; } for($k = $val_st_idx+1; $k < strlen($att_string); $k++) { if(($value_delim !== null && $att_string{$k} == $value_delim) || ($value_delim === null && ($att_string{$k} == " " || $k >= strlen($att_string)-1))) { //set $value $value = trim(substr($att_string, $val_st_idx, ($k-$val_st_idx)+1)); //break loop $k = strlen($att_string); } } if(is_string($key) && is_string($value) && strlen($key) > 0 && strlen($value) > 0) { $atts[] = array('literal' => $key."=".$value, 'name' => $key, 'value' => $value); } } } if(count($atts) > 0) { $nodes[$i]['attributes'] = $atts; } } return $nodes; }How can i integrate this great class to CodeIgniter web framework ?
You would have to ask in a CodeIgniter forum. It should be an easy class to deploy if you have decent PHP knowledge.
If I remove the ampersand from line 65, will the class retain its functionality? (I'm trying to make this PHP 4 compatible.)
Hi Will,
No that will break it in PHP 4.
You need to add the key in the foreach and then modify the
$nodesarray using the key to find the current node, like this:Thank you so much
Great script :)
Nice script David!
But there is one serious problem I suppose - it does not work with the multiline tags, like:
Of course I can remove all CRLF's before processing, but it will ruin the formatting.
Is there a more elegant way to process such documents?
Hi,
This is a small bug that has now been fixed.
A more elegant way to do this would be to use the DOM API, but it is not good at parsing poorly formated documents.
Thanks Dave! After that fix it is uncompetable!