Check Your Error 404 Database for Broken Links with cURL

Some websites that I maintain have a database for tracking which pages have moved. The problem is that some of the links that indicate where a page has moved to break. So I end up with a 404 error saying a page has moved. When the visitor goes to the new page, they are greeted with another 404 error saying the page has moved. So let's look into writing a script to look through the database for broken links.

Background

For the sake of this example, we'll use the same database table as the previous post (Help Visitors Find Moved Pages with a Simple Error 404 Database).

id date oldAddress message
1 2014-01-24 /about/oldFile.pdf has been removed
2 2014-04-02 /about/bio/johnsmith.php has moved; <a href="/about/viewbio.php?id=1">view John Smith's bio</a>
3 2014-04-02 /about/bio/jakebible.php has moved; <a href="/about/viewbio.php?id=2">view Jake Bible's bio</a>
4 2014-05-01 /resources/old_page.php has been removed

We're also going to leverage the code from the 3-part post titled "Check for Broken Links with PHP." Of course, the code will be modified to work with a database.

Checking for Broken Links

The goal of this program is to check for broken links in the message field from the database. To do that, we'll need to connect with the database. Let's also create a few variables to be used later and establish a DOMDocument object.

<?php
//CONNECT WITH DATABASE
require "{$_SERVER['DOCUMENT_ROOT']}/../database_connection.php";
$connect = new connect(true);
$mysqli  = $connect->databaseObject;
 
//INITIALIZE VARIABLES
$badLinks       = array();
$changedLinks   = array();
$goodLinks      = array();
$badStatusCodes = array('308', '404');
 
//INITIALIZE DOMDOCUMENT
$domDoc = new DOMDocument;
$domDoc->preserveWhiteSpace = false;
?>

Note that you can find more information about the above database connection script in the post titled "End PHP Scripts Gracefully After a Failed Database Connection." Next, we'll need the 404 error messages to loop through.

<?php
//...
 
//INITIALIZE DOMDOCUMENT
$domDoc = new DOMDocument;
$domDoc->preserveWhiteSpace = false;
 
//GET ERROR 404 ENTRIES
$sql = "SELECT id, message FROM error404";
$result = $mysqli->query($sql);
while($row = $result->fetch_assoc()) {
 
}

?>

Since the messages are strings, DOMDocument's loadHTML() method is used to load the HTML.

<?php
//...
 
$result = $mysqli->query($sql);
while($row = $result->fetch_assoc()) {
     //IF ERROR 404 MESSAGE LOADS
     if(@$domDoc->loadHTML($row['message'])) {
 
     //ELSE...UNABLE TO LOAD MESSAGE FOR CHECKING
     } else {
          print '<div>DOMDocument failed</div>';
     }

}
?>

We can then loop through any anchor tag(s) embedded within the message looking for ones that have an "href" attribute.

<?php
//...
 
while($row = $result->fetch_assoc()) {
     //IF ERROR 404 MESSAGE LOADS
     if(@$domDoc->loadHTML($row['message'])) {
          //LOOP THROUGH ANCHOR TAGS IN THE ERROR 404 MESSAGE
          $messageLinks = $domDoc->getElementsByTagName('a');
          foreach($messageLinks as $currLink) {
               //LOOP THROUGH ATTRIBUTES FOR CURRENT ANCHOR TAG
               foreach($currLink->attributes as $attributeName=>$attributeValue) {
                    //IF CURRENT ATTRIBUTE CONTAINS A WEBSITE LINK
                    if($attributeName == 'href') {
 
                    }
               }
          }

 
     //ELSE...UNABLE TO LOAD MESSAGE FOR CHECKING
     } else {
          print '<div>DOMDocument failed</div>';
     }
}
?>

Since the database contains root-relative links, we'll need to convert them into absolute links.

<?php
//...
 
//IF CURRENT ATTRIBUTE CONTAINS A WEBSITE LINK
if($attributeName == 'href') {
     //IF LINK IS ROOT-RELATIVE, MAKE IT ABSOLUTE
     if(substr($attributeValue->value, 0, 1) == '/') {
          $attributeValue->value = 'http://www.yourwebsite.com' . $attributeValue->value;
     }

}
 
//...
?>

We're now ready to execute a cURL request to check if the link is still valid. The result will be stored in the variables created earlier.

<?php
//...
 
//IF CURRENT ATTRIBUTE CONTAINS A WEBSITE LINK
if($attributeName == 'href') {
     //IF LINK IS ROOT-RELATIVE, MAKE IT ABSOLUTE
     if(substr($attributeValue->value, 0, 1) == '/') {
          $attributeValue->value = 'http://www.yourwebsite.com' . $attributeValue->value;
     }
 
     //RUN cURL TO CHECK THE LINK
     $ch = curl_init($attributeValue->value);
     curl_setopt($ch, CURLOPT_NOBODY, true);
     curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
     curl_exec($ch);
     $returnCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
     $finalURL   = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
     curl_close($ch);
 
     //PROCESS THE RETURN CODE
     if(in_array($returnCode, $badStatusCodes)) {
          $badLinks[]     = array('id'=>$row['id'], 'name'=>$currLink->nodeValue, 'link'=>$attributeValue->value);
     } elseif($finalURL != $attributeValue->value) {
          $changedLinks[] = array('id'=>$row['id'], 'name'=>$currLink->nodeValue, 'link'=>$attributeValue->value, 'newLink'=>$finalURL);
     } else {
          $goodLinks[]    = array('id'=>$row['id'], 'name'=>$currLink->nodeValue, 'link'=>$attributeValue->value);
     }

}
 
//...
?>

All that's left to do is display the results.

<?php
//...
 
     //ELSE...UNABLE TO LOAD MESSAGE FOR CHECKING
     } else {
          print '<div>DOMDocument failed</div>';
     }
}
 
//DISPLAY RESULTS
print '<h2>Bad Links</h2>';
print '<pre>' . print_r($badLinks, true) . '</pre>';
print '<h2>Changed Links</h2>';
print '<pre>' . print_r($changedLinks, true) . '</pre>';
print '<h2>Good Links</h2>';
print '<pre>' . print_r($goodLinks, true) . '</pre>';

?>

Final Code

To help give you a better sense on how the pieces fit together, here is the entire script:

<?php
//CONNECT WITH DATABASE
require "{$_SERVER['DOCUMENT_ROOT']}/../database_connection.php";
$connect = new connect(true);
$mysqli  = $connect->databaseObject;
 
//INITIALIZE VARIABLES
$badLinks       = array();
$changedLinks   = array();
$goodLinks      = array();
$badStatusCodes = array('308', '404');
 
//INITIALIZE DOMDOCUMENT
$domDoc = new DOMDocument;
$domDoc->preserveWhiteSpace = false;
 
//GET ERROR 404 ENTRIES
$sql = "SELECT id, message FROM error404";
$result = $mysqli->query($sql);
while($row = $result->fetch_assoc()) {
     //IF ERROR 404 MESSAGE LOADS
     if(@$domDoc->loadHTML($row['message'])) {
          //LOOP THROUGH ANCHOR TAGS IN THE ERROR 404 MESSAGE
          $messageLinks = $domDoc->getElementsByTagName('a');
          foreach($messageLinks as $currLink) {
               //LOOP THROUGH ATTRIBUTES FOR CURRENT ANCHOR TAG
               foreach($currLink->attributes as $attributeName=>$attributeValue) {
                    //IF CURRENT ATTRIBUTE CONTAINS A WEBSITE LINK
                    if($attributeName == 'href') {
                         //IF LINK IS ROOT-RELATIVE, MAKE IT ABSOLUTE
                         if(substr($attributeValue->value, 0, 1) == '/') {
                              $attributeValue->value = 'http://www.yourwebsite.com' . $attributeValue->value;
                         }
 
                         //RUN cURL TO CHECK THE LINK
                         $ch = curl_init($attributeValue->value);
                         curl_setopt($ch, CURLOPT_NOBODY, true);
                         curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
                         curl_exec($ch);
                         $returnCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
                         $finalURL   = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
                         curl_close($ch);
 
                         //PROCESS THE RETURN CODE
                         if(in_array($returnCode, $badStatusCodes)) {
                              $badLinks[]     = array('id'=>$row['id'], 'name'=>$currLink->nodeValue, 'link'=>$attributeValue->value);
                         } elseif($finalURL != $attributeValue->value) {
                              $changedLinks[] = array('id'=>$row['id'], 'name'=>$currLink->nodeValue, 'link'=>$attributeValue->value, 'newLink'=>$finalURL);
                         } else {
                              $goodLinks[]    = array('id'=>$row['id'], 'name'=>$currLink->nodeValue, 'link'=>$attributeValue->value);
                         }
                    }
               }
          }
 
     //ELSE...UNABLE TO LOAD MESSAGE FOR CHECKING
     } else {
          print '<div>DOMDocument failed</div>';
     }
}
 
//DISPLAY RESULTS
print '<h2>Bad Links</h2>';
print '<pre>' . print_r($badLinks, true) . '</pre>';
print '<h2>Changed Links</h2>';
print '<pre>' . print_r($changedLinks, true) . '</pre>';
print '<h2>Good Links</h2>';
print '<pre>' . print_r($goodLinks, true) . '</pre>';
?>

Conclusion

Now keep in mind that the script can be a little slow. After all, it needs to visit each website address referenced in the database to see what happens. The more links you have, the longer it can take.

The script could be sped up by maintaining a list of links already checked. You would just need to check the website address being processed against the already checked addresses before running the cURL request.

0 Comments

There are currently no comments.

Leave a Comment