SEM Labs

Handcrafted Pixels, Code & Title Tags

Page Iterator Class for PHP

A common task in SEO scripting and dealing with APIs is downloading paged data – page iteration. I created a class to make this task a bit easier about a year ago. It supports downloading paginated data that use GET or POST to move the cursor on. To make sure it doesn't whir away when there is no data left, it has a callback function that is called after each URL has been downloaded. You can use this to run a reg ex or whatever on each page to make sure there is still data to be scraped.

This class requires my CURL class, which is included in the package. You'll need to require the cURL class for the PageIterator class to work.

The above example shows how you can build a paginated page downloaded in four lines. In this example http://www.google.co.uk/search is being set as the base URL to be iterated. Some GET data is then set, in this case a query for 'sem labs'. Then any necessary cURL options are set. Then the iterate method is called, which downloads the data. To give a dissection of the arguments:

  1. Sets the initial value of the iterator, in this case: http://www.google.co.uk/search?start=0
  2. Sets the amount that the iterator value should be increased for each page. So the second page will be: http://www.google.co.uk/search?start=10
  3. Sets the number of iterations you want to carry out
  4. Sets the name of the iterator in this case: $_GET[start]
  5. Sets whether the iterator is GET or POST
  6. A callback function to be called for each downloaded URL

To give you an example of a callback function here is one I use to check there is still data to be downloaded from a HTML source:

When the callback function returns true it will kill the iterator - making sure your not downloading duds.

This class is available under the MIT License.

Comments

Jez Replied at 10:19 PM on 27 Jan 2009

Cool script. I will be using this on my phone site to deal with the affiliate feeds. Was a bit of a pain dealing with them before. This should make it easier.

SJL Web Design Replied at 12:08 PM on 12 Mar 2009

Excellent script, another great resource. Keep up the good work.

Post Comment

Thin comments left for links will be deleted.

Entry Info

Categories