Crawling from Dynamic IP’s can be with or without cookies i.e. generally without cookies may be due to some program written to crawl and with cookie would be manual crawl.
Crawling without cookie would happen generally due to some program written to crawl. When program is written to crawl, the request to our website from this IP would be in huge number than normal.
This type of crawling can be handled by storing the incoming IP’s to our website in a HASHTABLE with count. Thereby if the count for a particular IP reaches a limit in a given time, we can take a measure to block this IP or to show a captcha image.
HASHTABLE can be cleared when count of HASHTABLE reaches a defined count, so that memory of the application due to HASHTABLE is maintained without hauling our website.
Crawling with cookie would happen generally when group of people manually involved in taking the content of our website. Hence when captcha image is shown, they enter captcha and keeps crawling our website. When we block the IP, they recreate a new IP by rebooting the MODEM.
This type of crawling can be handled as follows
Though this cannot be 100% safe but at least we can make the crawler think for a while.
Also, the distinct security cookie thus created can be normal html cookie. But the crawler can easily clear this cookie through browser and can restart crawling. Hence to extend our security further we can create a flash cookie which cannot be cleared through any browser.
Great post, Although some code examples would have been great to see to save us time :) cheers Doug