I Need To Get Thousands of Pages Out of Google

This post is prompted by a client’s need to delete thousands of pages from Google’s search results. His NSFW web site has user profiles which were unintentionally allowed to be indexed quite some time ago. Although Google has a page to manually submit specific URLs to be removed from search results, there is apparently no batch function provided to submit thousands of URLs.

Although Google does provide a function to delete entire directories, these URLs reside in a directory which also has other files which need to remain in the search results. Hence the problem.

I have altered the robots.txt file for the site to specify that the Googlebot should have ‘noindex’ and ‘noarchive’ applied to the file in question. The thousands of URLs are created when the file is accessed with a query-string, e.g., sitename.com/member.php?u=01234

I have also modified the member.php script so that when it is accessed by the Googlebot (or some other spider) it should return a HTTP 404 header, indicating the file has been deleted.

Finally, inside the HTML <HEAD> section of the page, I’ve included the following <META> tags:

<meta name="robots" content="noindex,nofollow">
<meta name="googlebot" content="noindex,nofollow,noarchive">

So, now I’m conducting a test.  What I want to have happen is for Google to attempt to spider the URLs on this page.  When it reaches the five sample URLs, it should get the ‘noindex’ and ‘noarchive’ instructions from robots.txt for each one and delete them from the search results.  If that doesn’t work, I want the HTTP ‘404′ to do the job.  If that doesn’t work, I want the <META> tags to do it.  And if that doesn’t work…

We’ll cross that bridge when we come to it.

Sample From Google Search Result for Old Format Profile URLs

181793

1628

5367

167446

161690

Leave a Reply

You must be logged in to post a comment.