SEM Labs

Handcrafted Pixels, Code & Title Tags

A while back someone contacted me about their WordPress blog borking out on some of their posts. After a bit of poking about it became apparent that this was because WordPress doesn't allow high Unicode characters in the URL. At first, I thought this would just be a change to a line in .htaccess, but there are a couple of other things that need to be changed too.

In the process of making this site, I needed a facility to strip attributes from HTML elements. My first stop was on the strip_tags page in the PHP manual. However, the function on there were pretty poor and borked out a lot. Google didn't provide any better results, so I ended up having to make one. The result are pretty good. After a few tweaks I ran over 10,000 tests on different web pages and didn't have any problems.

A common task in SEO scripting and dealing with APIs is downloading paged data – page iteration. I created a class to make this task a bit easier about a year ago. It supports downloading paginated data that use GET or POST to move the cursor on. To make sure it doesn't whir away when there is no data left, it has a callback function that is called after each URL has been downloaded. You can use this to run a reg ex or whatever on each page to make sure there is still data to be scraped.

This is an object oriented wrapper for PHP cURL support; designed to cut down on the amount of bloat that is required to deal with cURL in PHP. It supports multi-threading and has a built-in retry facility that will try to re-download a URL a given number of times if it recives a HTTP header code more or equal than 400.