For this tutorial you will need Microsoft Excel, Xenu’s Link Sleuth and Notepad.
The search engine will look for valid web pages written in the XML Sitemap. And by valid pages I mean pages with the “OK 200″ http request answer. It will not crawl pages with Error 4xx or 5xx (Bad request 400, Unauthorized 401, Forbidden 403, Internal Error 500 etc.) or pages with redirection 3xx (Moved 301, Found 302 etc.)
So how can we make sure that our sitemap is good? If you are submitting a new sitemap through Webmasters Tools it will take a while until all the links will be crawled and checked by the search engine. To receive a feedback can take ages. Of course, you will receive an instant answer regarding the integrity of the code submitted, but what about those pages with 404 error or what about the permanently redirected 301 pages? They are not suppose to be in the sitemap and I will show you an easy way to detect them using Xenu’s Link Sleuth (Windows only freeware application) and Microsoft Excel.
Import your data from Sitemap to Microsoft Excel
1. Open an Excel file
2. Go to “Data” Menu Tab (Alt+A)
3. Load your XML “From Web” (presuming that the XML is already online at yourdomain.com/sitemap.xml).
4. Import Data into the Worksheet in a table (By default it will be named Table1)
5. Save it as XLS or XLSX
Preparing the file to be imported by Xenu’s Link Sleuth
1. Open any text editor (I recommend Notepad++ for many reasons, you can download it from here: Download Notepad++)
2. From the Excel file select the entire column with URLs and copy & paste it into your text document
3. Save it as .txt file (E.g.: links-to-be-checked.txt)
Checking the URL’s integrity
1. Open Xenu’s Link Sleuth (you can download the application from: snafu.de or cnet.com)
2. Click on Options > Preferences
3. Modify the default options with what you see in the print-screen. Very important to select “Maximum depth” field with 0 and check “Treat redirections as errors”
4. Load your html data by going File > Check URL List (Test)
5. In case you have a very large number of links, be patient. There is a big chance that some URLs will not be checked because of some hosting issues or the scanning to stop in the middle of the scanning process. You need to access File > Retry broken links (CTRL + R).
6. When the link spidering is finished, export your results by accessing File > Export to TAB separated file (E.g.: export-for-excel.txt)
Importing the results in Microsoft Excel
1. On a new Worksheet import the Xenu’s Link Sleuth’s file by accessing “Data” Menu Tab (Alt + A) and after click the “From text” button;
2. Follow the Text Import Wizard: “Delimited – Characters such as commas or tabs separate each field” checked, “My data has headers” checked. After press “Next” button and make sure you have “Delimiters: Tab” checked and press Finish.
3. Convert the results into a table by selecting the whole table and pressing “CTRL+L” (By default it will be named Table2)
Synchronising the results using VLOOKUP function in Microsoft Excel
1. Add another column to your first table (E.g.: “Status Code”)
2. On the first cell write the following:
=VLOOKUP(A7, Table2[[Address]:[Status-Code]], 2, FALSE)
3. Apply the formula to all the cells bellow
4. Finally, you have your results in one table, now you can take measures and remove or fix those links which have different “HTTP OK 200″ from the XML Sitemap. Use the filter from Excel, like in the image, to view your web pages with problems.
In case you are missing something, you can download the final Excel file with the Sitemap already checked from here: how-to-check-the-integrity-of-your-xml-sitemap.xlsx