Friday, May 16, 2008

Gone

i like making legitimate use of a new http status code. a few months ago i discovered 503, which proved useful in telling greedy searchbots that they were overloading my site with their bot traffic and could you please try again later.

this week i used 410.

last year i set up sorabji weather for myself. along with a dict server and a web-based universal whois lookup the weather station is just one of those silly thing i've always wanted to do.

i thought i would be clever and connect the weather station to my payphone project, and to my mailbox locator. it made sense, as a search for outdoor objects like mailboxes or phones might benefit or at least be pleasantly complemented from knowing what the weather was like.

connecting weather to the mailbox site proved no problem. landing on my page for a mailbox at 1259 Sixth Avenue at 50th Street shows the current weather and offers links to New York Weather and weather 10019. this set-up is generally reliable because the quality of the database is generally GOOD.

the links from the payphone numbers, however, were another story. the quality of that database is, on its surface, very very poor. however its use was not intended for detailed screening, and precise accuracy was never assumed. collected from numerous sources, including law enforcement, telephone companies, and individuals, the collection is more of a guide than an authoratative resource.

i knew, of course, when i published the entire database of about 750,000 payphone numbers that it was filled with mis-spellings and other seemingly ridiculous errors.

nevertheless, having had success with linking the mailbox locations i thought i'd try the same with the payphones, for whatever value it might bring if and when it worked.

the problem was apparent almost immediately, as my error logs showed all kinds of crazy stuff related to the weather site. the pages tried to pull in weather snapshots for mis-spelled cities like New Yirk, NY, New Yor, NY, and Washingto, DC among thousands of others. searchbots followed these links like anything else, landing on a FRIENDLY http status 200 page telling you that new yor was not found.

the problem was with the status 200 header. searchbots gobbled up these garbage pages like anything else, polluting their indexes with useless typo pages. well, they are not totally useless, but they don't belong in search indexes. i checked to see how many pages from my weather site were indexed and there were thousands. problem was, most of them were nothing pages like Washingto or WSHNGTN. Abbreviated spellings like these were perfectly acceptable in their original context but not too useful for my particular weather site.

i could build my own weather suite. it's not that hard. but i bought a third party app instead. it's proven a bit sour in its way, but it's all good i suppose.

anyway, to purge the search indexes of these garbage pages i still send up the error page asking if you spelled something wrong, but now i send it with http status 410 headers, which are supposed to tell the search engines to remove these pages immediately. i like the wording of it, too: GONE! as in, WHOOSH!

i am interested to see if that works, and how long it takes. i never intended to litter the indexes this way, but i should have seen it coming.

fascinating, no?

no?

No comments: