A 
web crawler is also known as a web spider or web robot. It is a program 
or automated script which is used to browse the World Wide Web in a 
methodical and automated manner. The process is known as web crawling. Web crawling services
 are generally hired to create a copy of all the visited pages for 
future processing by a search engine that will create a file of the 
downloaded pages to provide fast searches. Crawlers’ automatically 
maintain tasks on a website such as testing links or validating HTML 
code. Crawlers are generally used to collect specific kind of 
information from the web pages such as harvesting e-mail addresses 
(usually for spam).
There
 are several uses for web crawlers, but basically, a web crawler is used
 to collect or mine data from the web. We use various open source web 
crawlers for delivering the result-based web crawling services. 
Heritrix
 - It is the Internet Archive's open-source, web-scale, extensible, 
archival quality web crawler project.  As our crawler hunt for 
collecting and preserving the digital data for the benefit of upcoming 
researchers and generations, this source seemed appropriate.
Scrapy
 - An open source and shared framework for mining the data required from
 websites in a fast, simple, yet extensible way. We build and run web 
spiders and deploy them to a scrapy cloud.
DataPark Search – This open source has features like
- Supporting HTTP, HTTPs, NNTP, FTP, and new URL schemes
 - HTDB virtual based URL scheme for indexing SQL databases
 - Indexes text/XML, text/plain, text/HTML, audio/MPEG, and image/GIF mime types natively
 
HTTrack
 - It allows downloading a website from the Internet to a local 
directory, getting HTML, images, building recursively all directories, 
and other files from the server to the computer. HTTrack gathers the 
original site's relative link-structure by simply opening a page of the 
"mirrored" website in the browser, and then you can browse the website 
from link to link as if it was viewing online.
PHPCrawl
 – It is a framework for crawling websites that are written in the PHP 
programming language. Thus it can be called as web crawler engine for 
PHP.

No comments:
Post a Comment