Hi,
In this post I will give the source code for crawling the various URL's of a particular webpage. The main things to keep in mind is pattern matching using preg_match() function. For achieving this first I am importing the entire page content of a web URL, and then I am parsing it to extract the various hyperlinks found in the page.
Now in this post I will just show the method of extracting the URL's. But you can implement this logic to do many tasks such as creating a search engine, which stores all the URL's found in a webpage along with it meta data, or you may also use this technique to create a sitemap for your website. More you think, more ways you may find to make this code into use.
So the code is:
<?php
function reqtime($url){
$start=microtime(true);
@file_get_contents($url);
$end=microtime(true);
return number_format($end-$start,2);
}
$body = file_get_contents("http://www.blogger.com/posts.g?blogID=3662517344220871310");
$out = array();
preg_match_all( "/(\<a.*?\>)/is", $body, $matches );
$count=0;
foreach( $matches[0] as $match )
{
$count++;
preg_match( "/href=(.*?)[\s|\>]/i", $match, $href );
if ( $href != null )
{
$href = $href[1];
$href = preg_replace( "/^\"/", "", $href );
$href = preg_replace( "/\"$/", "", $href );
if ( preg_match( "/^mailto:/", $href ) || preg_match( "/^javascript:/", $href ) || preg_match( "/^#/", $href ))
{
}
elseif ( preg_match( "/^http:\/\//", $href ) )
{
if ( preg_match( '/^$base/', $href ) )
if(substr($href,0,1)!="#"){
$out []= trim(str_replace("'","",$href)). "(time: ".reqtime(trim(str_replace("'","",$href)))." secs)";
}
}
else
{
if(substr($href,0,1)!="#"){
$out []= trim(str_replace("'","",$base.$href)). "(time: ".reqtime(trim(str_replace("'","",$base.$href)))." secs)";
}
}
}
if($count==5){break;}
}
echo '<pre>';
print_r($out);
echo '</pre>';
?>
In the above code I am using pattern matching first to filter all the anchor tags by
preg_match_all( "/(\)/is", $body, $matches );
And finally checking the authenticity of the value for the href attribute by
if ( preg_match( "/^mailto:/", $href ) || preg_match( "/^javascript:/", $href ) || preg_match( "/^#/", $href ))
When all the conditions are satisfied I am storing the links in an array, $out. Finally I am displaying the array to get the following output:
I have used the following function to calculate the time needed to load a webpage:
function reqtime($url){ $start=microtime(true); @file_get_contents($url); $end=microtime(true); return number_format($end-$start,2);}
Cheers!!
In this post I will give the source code for crawling the various URL's of a particular webpage. The main things to keep in mind is pattern matching using preg_match() function. For achieving this first I am importing the entire page content of a web URL, and then I am parsing it to extract the various hyperlinks found in the page.
Now in this post I will just show the method of extracting the URL's. But you can implement this logic to do many tasks such as creating a search engine, which stores all the URL's found in a webpage along with it meta data, or you may also use this technique to create a sitemap for your website. More you think, more ways you may find to make this code into use.
So the code is:
<?php
function reqtime($url){
$start=microtime(true);
@file_get_contents($url);
$end=microtime(true);
return number_format($end-$start,2);
}
$body = file_get_contents("http://www.blogger.com/posts.g?blogID=3662517344220871310");
$out = array();
preg_match_all( "/(\<a.*?\>)/is", $body, $matches );
$count=0;
foreach( $matches[0] as $match )
{
$count++;
preg_match( "/href=(.*?)[\s|\>]/i", $match, $href );
if ( $href != null )
{
$href = $href[1];
$href = preg_replace( "/^\"/", "", $href );
$href = preg_replace( "/\"$/", "", $href );
if ( preg_match( "/^mailto:/", $href ) || preg_match( "/^javascript:/", $href ) || preg_match( "/^#/", $href ))
{
}
elseif ( preg_match( "/^http:\/\//", $href ) )
{
if ( preg_match( '/^$base/', $href ) )
if(substr($href,0,1)!="#"){
$out []= trim(str_replace("'","",$href)). "(time: ".reqtime(trim(str_replace("'","",$href)))." secs)";
}
}
else
{
if(substr($href,0,1)!="#"){
$out []= trim(str_replace("'","",$base.$href)). "(time: ".reqtime(trim(str_replace("'","",$base.$href)))." secs)";
}
}
}
if($count==5){break;}
}
echo '<pre>';
print_r($out);
echo '</pre>';
?>
In the above code I am using pattern matching first to filter all the anchor tags by
preg_match_all( "/(\)/is", $body, $matches );
And finally checking the authenticity of the value for the href attribute by
if ( preg_match( "/^mailto:/", $href ) || preg_match( "/^javascript:/", $href ) || preg_match( "/^#/", $href ))
When all the conditions are satisfied I am storing the links in an array, $out. Finally I am displaying the array to get the following output:
I have used the following function to calculate the time needed to load a webpage:
function reqtime($url){ $start=microtime(true); @file_get_contents($url); $end=microtime(true); return number_format($end-$start,2);}
Cheers!!
No comments:
Post a Comment