Check for Broken Links with PHP Part 3: Targeted Search

Over the past few weeks we discussed a broken-link checker which leverages PHP's cURL library and DOMDocument. As the code stands, the script checks every link within a given page. That's great if we want to check every link, but what if we want to target a specific section of a page? Let's take a look at how this can be accomplished.

Background

When writing the broken-link checker, I was mostly interested in checking a handful of pages which have hundreds of links in the main content area. That section of the website is enclosed by a <div> tag like the following:

<div id="mainContent">
     main content for the page goes here
</div>

Since the <div> tag has an id attribute, it can be targeted with DOMDocument's getElementById method. Using the code from the last post, we'll change the part that grabs all the anchor tags so it only gets the ones within the "mainContent" <div>.

<?php
//...
 
//IF THE PAGE BEING CHECKED LOADS
if(@$domDoc->loadHTMLFile($pageToCheck)) { //note that errors are suppressed so DOMDocument doesn't complain about XHTML
     //LOOP THROUGH ANCHOR TAGS IN THE MAIN CONTENT AREA
     $mainContent = $domDoc->getElementById('mainContent');
     $pageLinks   = $mainContent->getElementsByTagName('a');

     foreach($pageLinks as $currLink) {
 
//...
?>

From there, we just need to run the cURL requests as before.

Final Code

To help give a better sense on how the pieces fit together, here is the entire script:

<?php
//INITIALIZE VARIABLES
$pageToCheck    = $_GET['link'];
$badLinks       = array();
$goodLinks      = array();
$changedLinks   = array();
$badStatusCodes = array('308', '404');
 
//INITIALIZE DOMDOCUMENT
$domDoc = new DOMDocument;
$domDoc->preserveWhiteSpace = false;
 
//IF THE PAGE BEING CHECKED LOADS
if(@$domDoc->loadHTMLFile($pageToCheck)) { //note that errors are suppressed so DOMDocument doesn't complain about XHTML
     //LOOP THROUGH ANCHOR TAGS IN THE MAIN CONTENT AREA
     $mainContent = $domDoc->getElementById('mainContent');
     $pageLinks   = $mainContent->getElementsByTagName('a');
     foreach($pageLinks as $currLink) {
          //LOOP THROUGH ATTRIBUTES FOR CURRENT LINK
          foreach($currLink->attributes as $attributeName=>$attributeValue) {
               //IF CURRENT ATTRIBUTE CONTAINS THE WEBSITE ADDRESS
               if($attributeName == 'href') {
                    //INITIALIZE CURL AND TEST THE LINK
                    $ch = curl_init($attributeValue->value);
                    curl_setopt($ch, CURLOPT_NOBODY, true);
                    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
                    curl_exec($ch);
                    $returnCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
                    $finalURL   = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
                    curl_close($ch);
 
                    //TRACK THE RESPONSE
                    if(in_array($returnCode, $badStatusCodes)) {
                         $badLinks[]     = array('name'=>$currLink->nodeValue, 'link'=>$attributeValue->value);
                    } elseif($finalURL != $attributeValue->value) {
                         $changedLinks[] = array('name'=>$currLink->nodeValue, 'link'=>$attributeValue->value, 'newLink'=>$finalURL);
                    } else {
                         $goodLinks[]    = array('name'=>$currLink->nodeValue, 'link'=>$attributeValue->value);
                    }
               }
          }
     }
 
     //DISPLAY RESULTS
     print '<h2>Bad Links</h2>';
     print '<pre>' . print_r($badLinks, true) . '</pre>';
     print '<h2>Changed Links</h2>';
     print '<pre>' . print_r($changedLinks, true) . '</pre>';
     print '<h2>Good Links</h2>';
     print '<pre>' . print_r($goodLinks, true) . '</pre>';
}
?>

Conclusion

Now of course, the script needs to be customized to meet your needs. Perhaps you're using a different ID for the main content area. If you don't use IDs, you could see if the other methods from the DOMDocument class work. Or you could use the code from the previous code which checks all links. You'll just need to wait a little longer for the scan to complete.

Related Posts

2 Comments

  • #1 atmiyadas on 09.16.14 at 1:50 pm

    hi,
    thanks for this script .
    i need check all page link check of website.
    currently check only single page link.

    if possible?
    then help me .

    thanks

  • #2 Patrick Nichols on 09.18.14 at 6:04 am

    @atmiyadas – There should be a way to modify the script so that it searches all pages of a website. I unfortunately haven't put enough thought into how that would be accomplished.

    If you're interested, the W3C has a free Link Checker here:
    http://validator.w3.org/checklink

    If the "Check linked documents recursively" option is selected and you indicate the recursion depth, it could be used to check all the links on your website.

Leave a Comment