SEM Labs

Handcrafted Pixels, Code & Title Tags

PHP Strip Attributes Class For XML and HTML

In the process of making this site, I needed a facility to strip attributes from HTML elements. My first stop was on the strip_tags page in the PHP manual. However, the functions on there were pretty poor and borked out a lot. Google didn't provide any better results, so I ended up having to make one. The results are pretty good. After a few tweaks I ran over 10,000 tests on different web pages and didn't have any problems.

The class will work on any form of XML markup, not just HTML. This class doesn't strip tags. To do so you will need to use it in conjunction with strip_tags. It picks up any form of invalid attributes that get rendered by browsers like name=value and name = "value". So you can make sure your not gonna get owned by progressive HTML injectors and XSS kiddies :)

For the sake of making it easy to use, I just botched a required function that escapes strings for regular expression at the top of the class. For some reason PCRE doesn't escape properly. So, you may want to move this function elsewhere.

In the above example the allow variable sets attributes that are to be allowed on all elements, the exceptions variable sets attributes that are to be allowed on specific elements and the ignore variable sets elements that are to be totally ignored. So, the example will except id and class attributes on any element, src and alt attributes on img elements, href and title on a elements and will not ignore any tags.

Version History

Version Release Notes
0.1 27 Jan 2009 Initial release
0.2 16 Mar 2009 Expanded support for XML node names (such as namespaces), fixed a bug with finding self-closing tags and expanded support for malformed attributes.
0.2.1 23 Oct 2009 Fixed parsing of elements that contain new lines.

This class is available under the MIT License.

Comments

Matt Replied at 11:05 AM on 11 Mar 2009

Is this released under a gpl compatible license?

matt Replied at 12:36 PM on 11 Mar 2009

Great class, but 2 issues.

1. The function isException returns an error and ends up stripping the tag entirely if it does not have an exception. Fix this by replacing:

if( $this->exceptions[$element_name] ) {

with:

if( array_key_exists($element_name, $this->exceptions) ) {

2. The class drops tags with numbers in them, such as h[1-5]. To fix this, in the findElements function, replace this:

preg_match_all( "/]*)>/i", $this->str, $elements );

with this:

preg_match_all( "/]*)>/i", $this->str, $elements );

Thanks,

David Replied at 7:56 PM on 11 Mar 2009

Hi Matt. Thanks for your bug fixes. I'll make the necessary changes and upload the new version. I might as well add support for namespaces while I'm at it.

Re. the licensing, I haven't put it under one yet, but you can do pretty much anything you want with this.

BOb Replied at 10:22 AM on 20 Mar 2009

is strips the forwardslash off of and makes it .

Sebastian Replied at 10:03 AM on 1 Apr 2009

Hi there.

Great class - helped me a lot.

Just one thing - I used it on an XML String, and it took away the ? on the end of the xml declaration.

became - that shouldnt happen, right?

Except that - thank you.

David Replied at 4:16 PM on 20 Apr 2009

Actually, you can just ignore these elements with the ignore property.

David Replied at 10:21 AM on 1 Apr 2009

Thanks for your input Sebastian. The problem with the XML deceleration raises the question about XML processing instructions and DTDs. I will check this out later on today and post an updated class.

fesja Replied at 5:43 PM on 11 May 2009

thanks a lot! It has been very useful :-)

fesja Replied at 9:45 PM on 11 May 2009

hi again,

I've just found that if we don't have to remove an attribute, there is a notice that can produce an error in headers sent (test with an app made with cakephp). The solution is in line 55:

# Return the XML if there were no attributes to remove

if(count($nodes) > 0)

return $nodes;

else

return $this->str;

greetings

lovelf Replied at 12:30 PM on 4 Jun 2009

I don't know how to use it, I want to strip all inline-style attributes from my html page, what do I do? Thanks

David Replied at 12:24 AM on 5 Jun 2009

Like the core PHP strip_tags function this class requires you to define what attributes you want to keep. For example if you have the following html

<div style="background: red;"><div>

You could remove the style attributes simply by running the following code, where $str is the above html snippet:

$sa = new StripAttributes();
$str = $sa->strip( $str );

If there are some attributes you want to preserve, you can define them like so:

$sa = new StripAttributes();
$sa->allow = array( 'id', 'class' );
$str = $sa->strip( $str );

In that example any id and class attributes will be preserved.

You can find out more about the functionality of this class by reading the above documentation.

Ronald Replied at 6:47 PM on 8 Jun 2009

can it be written in one function? like..

public function strip( $str, $allow=null, $exception=null, $ignore=null ) {

$this->allow=$allow;

$this->exception=$exception;

$this->ignore =$ignore ;

// statements here

}

David Replied at 1:45 PM on 16 Jun 2009

Yes. You can do that if you want to, but there is not really much point in it.

Gulshan Replied at 6:41 PM on 24 Jun 2009

Very nice and clean script. Hope it will help me in my current project...Thanks a lot

Alex Replied at 4:20 PM on 29 Jul 2009

Thank you! I was trying to figure this out on my own and was going a little insane :)

Sim Replied at 12:58 PM on 5 Aug 2009

Hi,

I had a problem with this class earlier today and was able to resolve it fairly quickly. It came down to the findAttributes() method. It seems unable to ascertain the difference between the different sort of delimiters (if any are present). I have replaced it with the following (in my implementation) however it does require that all newlines are replaced with whitespace - which is easy enough like so:..

$remove = array("\r", "\n");

$content = str_replace($remove, " ", $content);

The code I now have is as follows:

private function findAttributes( $nodes )
{
	for($i = 0; $i < count($nodes); $i++)
	{
		$att_string = $nodes[$i]['attributes'];
		$nodes[$i]['attributes'] = null;
		$atts = array();
		for($j = 0; $j < strlen($att_string); $j++)
		{
			if($att_string{$j} === "=")
			{
				$key = "";
				$key_st_idx = $j-1;
				while($att_string{$key_st_idx} == " ")
				{
					$key_st_idx--;
				}

				for($k = $key_st_idx; $k >= 0; $k--)
				{
					if($att_string{$k} == " "
					|| $att_string{$k} == "'"
					|| $att_string{$k} == "\""
					|| $k == 0)
					{
						//set key
						$key = trim(substr($att_string, $k+1, $key_st_idx-$k));
						//break loop
						$k = -1;
					}
				}

				$val_st_idx = $j+1;
				$value_delim = null;
				$value = "";
				while($att_string{$val_st_idx} == " ")
				{
					$val_st_idx++;
				}
				if($att_string{$val_st_idx} == "\"" || $att_string{$val_st_idx} == "'")
				{
					$value_delim = $att_string{$val_st_idx};
				}
				for($k = $val_st_idx+1; $k < strlen($att_string); $k++)
				{
					if(($value_delim !== null && $att_string{$k} == $value_delim)
					|| ($value_delim === null && ($att_string{$k} == " " || $k >= strlen($att_string)-1)))
					{
						//set $value
						$value = trim(substr($att_string, $val_st_idx, ($k-$val_st_idx)+1));
						//break loop
						$k = strlen($att_string);
					}
				}
				if(is_string($key) && is_string($value) && strlen($key) > 0 && strlen($value) > 0)
				{
					$atts[] = array('literal' => $key."=".$value, 'name' => $key, 'value' => $value);
				}
			}
		}
		if(count($atts) > 0)
		{
			$nodes[$i]['attributes'] = $atts;
		}
	}
	return $nodes;
}
chngr Replied at 1:37 PM on 9 Aug 2009

How can i integrate this great class to CodeIgniter web framework ?

David Replied at 2:22 PM on 10 Aug 2009

You would have to ask in a CodeIgniter forum. It should be an easy class to deploy if you have decent PHP knowledge.

Will Replied at 5:32 PM on 25 Aug 2009

If I remove the ampersand from line 65, will the class retain its functionality? (I'm trying to make this PHP 4 compatible.)

David Replied at 5:34 PM on 26 Aug 2009

Hi Will,
No that will break it in PHP 4.

You need to add the key in the foreach and then modify the $nodes array using the key to find the current node, like this:


private function findAttributes( $nodes ) {
	
	# Extract attributes
	foreach( $nodes as $key => $node ) {
		preg_match_all( "/([^ =]+)\s*=\s*[\"|']{0,1}([^\"']*)[\"|']{0,1}/i", $node['attributes'], $attributes );
		#print_r( $attributes[1] );
		if( $attributes[1] ) {
			foreach( $attributes[1] as $att_key => $att ) {
				$literal = $attributes[0][$att_key];
				$attribute_name = $attributes[1][$att_key];
				$value = $attributes[2][$att_key];
				$atts[] = array( 'literal' => $literal, 'name' => $attribute_name, 'value' => $value );
			}
		}
		else
			$nodes[$key]['attributes'] = null;
		
		$nodes[$key]['attributes'] = $atts;
		unset( $atts );
	}
	
	return $nodes;
}
aneildo Replied at 4:17 PM on 10 Oct 2009

Thank you so much

Great script :)

Sergey Replied at 9:16 AM on 22 Oct 2009

Nice script David!

But there is one serious problem I suppose - it does not work with the multiline tags, like:

Of course I can remove all CRLF's before processing, but it will ruin the formatting.

Is there a more elegant way to process such documents?

David Replied at 2:51 PM on 23 Oct 2009

Hi,

This is a small bug that has now been fixed.

A more elegant way to do this would be to use the DOM API, but it is not good at parsing poorly formated documents.

Sergey Replied at 10:27 AM on 23 Oct 2009

Thanks Dave! After that fix it is uncompetable!

James Replied at 7:54 AM on 27 Jul 2010

Thanks!

Instead of using the reg_escape function you can use

preg_quote( $node['literal'], '/' )

instead.

Jason Replied at 7:00 PM on 25 Aug 2010
Cornelius Parkin Replied at 5:10 AM on 4 Mar 2011

Very cool class :-). Thanks for that...

bobbo Replied at 10:14 AM on 14 Apr 2011

A bug: line 59 throws an error if there are no attributes to to remove.

if( !$nodes[0] )

works better as

if( empty($nodes) )

mattw Replied at 1:13 PM on 14 Oct 2011

Thanks so much! You just saved me hours!!

Post Comment

Thin comments left for links will be deleted.

Entry Info

Categories