Friday, March 25, 2011

Crawling URLs - PHP

Hi,

In this post I will give the source code for crawling the various URL's of a particular webpage. The main things to keep in mind is pattern matching using preg_match() function. For achieving this first I am importing the entire page content of a web URL, and then I am parsing it to extract the various hyperlinks found in the page.

Now in this post I will just show the method of extracting the URL's. But you can implement this logic to do many tasks such as creating a search engine, which stores all the URL's found in a webpage along with it meta data, or you may also use this technique to create a sitemap for your website. More you think, more ways you may find to make this code into use.

So the code is:


<?php
function reqtime($url){
$start=microtime(true);
@file_get_contents($url);
$end=microtime(true);
return number_format($end-$start,2);
}

$body = file_get_contents("http://www.blogger.com/posts.g?blogID=3662517344220871310");
$out = array();
preg_match_all( "/(\<a.*?\>)/is", $body, $matches );
$count=0;
foreach( $matches[0] as $match )
{
$count++;
preg_match( "/href=(.*?)[\s|\>]/i", $match, $href );

if ( $href != null )
{
  $href = $href[1];
  $href = preg_replace( "/^\"/", "", $href );
  $href = preg_replace( "/\"$/", "", $href );

  if ( preg_match( "/^mailto:/", $href ) || preg_match( "/^javascript:/", $href ) || preg_match( "/^#/", $href ))
  {
  }
  elseif ( preg_match( "/^http:\/\//", $href ) )
  {
if ( preg_match( '/^$base/', $href ) )

 if(substr($href,0,1)!="#"){
$out []= trim(str_replace("'","",$href)). "(time: ".reqtime(trim(str_replace("'","",$href)))." secs)";
}
  }
  else
  {
if(substr($href,0,1)!="#"){
$out []= trim(str_replace("'","",$base.$href)). "(time: ".reqtime(trim(str_replace("'","",$base.$href)))." secs)";
}
  }
}

if($count==5){break;}
}
echo '<pre>';
print_r($out);
echo '</pre>';
?>


In the above code I am using pattern matching first to filter all the anchor tags by
preg_match_all( "/(\)/is", $body, $matches );
And finally checking the authenticity of the value for the href attribute by

if ( preg_match( "/^mailto:/", $href ) || preg_match( "/^javascript:/", $href ) || preg_match( "/^#/", $href ))

When all the conditions are satisfied I am storing the links in an array, $out. Finally I am displaying the array to get the following output:


I have used the following function to calculate the time needed to load a webpage:

function reqtime($url){ $start=microtime(true); @file_get_contents($url); $end=microtime(true); return number_format($end-$start,2);}

Cheers!!

No comments:

Post a Comment