Check for Broken Links with PHP Part 2: Capture Redirected Links

The link checker discussed previously was designed to report obviously broken links. There is, however, another type of broken link that isn't reported. When a web page is renamed or moved, a redirect may be created so visitors of the old page are automatically sent to the new location. To detect these types of broken links, we'll need to make a few minor modifications.

Current Code

In the last post (Check for Broken Links with PHP), we built a script that accepts a website address through a GET variable. The script then extracts all the links from the page and verifies if they lead to a web page.

<?php
//INITIALIZE VARIABLES
$pageToCheck    = $_GET['link'];
$badLinks       = array();
$goodLinks      = array();
$badStatusCodes = array('308', '404');
 
//INITIALIZE DOMDOCUMENT
$domDoc = new DOMDocument;
$domDoc->preserveWhiteSpace = false;
 
//IF THE PAGE BEING CHECKED LOADS
if(@$domDoc->loadHTMLFile($pageToCheck)) { //note that errors are suppressed so DOMDocument doesn't complain about XHTML
     //LOOP THROUGH ANCHOR TAGS IN THE MAIN CONTENT AREA
     $pageLinks = $domDoc->getElementsByTagName('a');
     foreach($pageLinks as $currLink) {
          //LOOP THROUGH ATTRIBUTES FOR CURRENT LINK
          foreach($currLink->attributes as $attributeName=>$attributeValue) {
               //IF CURRENT ATTRIBUTE CONTAINS THE WEBSITE ADDRESS
               if($attributeName == 'href') {
                    //INITIALIZE CURL AND TEST THE LINK
                    $ch = curl_init($attributeValue->value);
                    curl_setopt($ch, CURLOPT_NOBODY, true);
                    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
                    curl_exec($ch);
                    $returnCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
                    curl_close($ch);
 
                    //TRACK THE RESPONSE
                    if(in_array($returnCode, $badStatusCodes)) {
                         $badLinks[]  = array('name'=>$currLink->nodeValue, 'link'=>$attributeValue->value);
                    } else {
                         $goodLinks[] = array('name'=>$currLink->nodeValue, 'link'=>$attributeValue->value);
                    }
               }
          }
     }
 
     //DISPLAY RESULTS
     print '<h2>Bad Links</h2>';
     print '<pre>' . print_r($badLinks, true) . '</pre>';
     print '<h2>Good Links</h2>';
     print '<pre>' . print_r($goodLinks, true) . '</pre>';
}
?>

Detect Redirects

Like our good/bad links, we'll need a variable to store the links that changed.

<?php
//INITIALIZE VARIABLES
$pageToCheck    = $_GET['link'];
$badLinks       = array();
$goodLinks      = array();
$changedLinks   = array();
$badStatusCodes = array('308', '404');
 
//...
?>

Then we'll modify the cURL request to see if an address is redirected. This can be accomplished with curl_getinfo() and its option to get the last effective URL.

<?php
//...
 
//INITIALIZE CURL AND TEST THE LINK
$ch = curl_init($attributeValue->value);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_exec($ch);
$returnCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
$finalURL   = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
curl_close($ch);
 
//...
?>

Within the code block for testing the response, an extra test is added to compare the link used on the website to the effective URL returned from cURL.

<?php
//...
 
//TRACK THE RESPONSE
if(in_array($returnCode, $badStatusCodes)) {
     $badLinks[]     = array('name'=>$currLink->nodeValue, 'link'=>$attributeValue->value);
} elseif($finalURL != $attributeValue->value) {
     $changedLinks[] = array('name'=>$currLink->nodeValue, 'link'=>$attributeValue->value, 'newLink'=>$finalURL);

} else {
     $goodLinks[]    = array('name'=>$currLink->nodeValue, 'link'=>$attributeValue->value);
}
 
//...
?>

Then we just display the results.

<?php
//...
 
//DISPLAY RESULTS
print '<h2>Bad Links</h2>';
print '<pre>' . print_r($badLinks, true) . '</pre>';
print '<h2>Changed Links</h2>';
print '<pre>' . print_r($changedLinks, true) . '</pre>';</strong>

print '<h2>Good Links</h2>';
print '<pre>' . print_r($goodLinks, true) . '</pre>';
 
//...
?>

Final Code

To help give a better sense on how the pieces fit together, here is the entire script:

<?php
//INITIALIZE VARIABLES
$pageToCheck    = $_GET['link'];
$badLinks       = array();
$goodLinks      = array();
$changedLinks   = array();
$badStatusCodes = array('308', '404');
 
//INITIALIZE DOMDOCUMENT
$domDoc = new DOMDocument;
$domDoc->preserveWhiteSpace = false;
 
//IF THE PAGE BEING CHECKED LOADS
if(@$domDoc->loadHTMLFile($pageToCheck)) { //note that errors are suppressed so DOMDocument doesn't complain about XHTML
     //LOOP THROUGH ANCHOR TAGS IN THE MAIN CONTENT AREA
     $pageLinks = $domDoc->getElementsByTagName('a');
     foreach($pageLinks as $currLink) {
          //LOOP THROUGH ATTRIBUTES FOR CURRENT LINK
          foreach($currLink->attributes as $attributeName=>$attributeValue) {
               //IF CURRENT ATTRIBUTE CONTAINS THE WEBSITE ADDRESS
               if($attributeName == 'href') {
                    //INITIALIZE CURL AND TEST THE LINK
                    $ch = curl_init($attributeValue->value);
                    curl_setopt($ch, CURLOPT_NOBODY, true);
                    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
                    curl_exec($ch);
                    $returnCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
                    $finalURL   = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
                    curl_close($ch);
 
                    //TRACK THE RESPONSE
                    if(in_array($returnCode, $badStatusCodes)) {
                         $badLinks[]     = array('name'=>$currLink->nodeValue, 'link'=>$attributeValue->value);
                    } elseif($finalURL != $attributeValue->value) {
                         $changedLinks[] = array('name'=>$currLink->nodeValue, 'link'=>$attributeValue->value, 'newLink'=>$finalURL);
                    } else {
                         $goodLinks[]    = array('name'=>$currLink->nodeValue, 'link'=>$attributeValue->value);
                    }
               }
          }
     }
 
     //DISPLAY RESULTS
     print '<h2>Bad Links</h2>';
     print '<pre>' . print_r($badLinks, true) . '</pre>';
     print '<h2>Changed Links</h2>';
     print '<pre>' . print_r($changedLinks, true) . '</pre>';
     print '<h2>Good Links</h2>';
     print '<pre>' . print_r($goodLinks, true) . '</pre>';
}
?>

Conclusion

As mentioned in the last post, the above code was mostly meant as an experiment. There are linking-checking services available if you're not interested in building your own. This code just gave me the chance to dig further into using cURL which should be useful for future projects.

Related Posts

0 Comments

There are currently no comments.

Leave a Comment