nutch

http://nutch.apache.org/

Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene, the project has diversified and now comprises two codebases, namely:

Nutch 1.x: A well matured, production ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which are great for batch processing. Nutch 2.x: An emerging alternative taking direct inspiration from 1.x, but which differs in one key area; storage is abstracted away from any specific underlying data store by using Apache Gora for handling object to persistent mappings. This means we can implement an extremely flexibile model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions. Being pluggable and modular of course has it’s benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter’s for custom implementations e.g. Apache Tika for parsing. Additionally, pluggable indexing exists for Apache Solr, Elastic Search, etc.

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster

scrapy

http://scrapy.org/

An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.

gocrawl

https://github.com/PuerkitoBio/gocrawl/

gocrawl is a polite, slim and concurrent web crawler written in Go.

For a simpler yet more flexible web crawler you may also want to take a look at fetchbot, a package that builds on the experience of gocrawl.

Features

Full control over the URLs to visit, inspect and query (using a pre-initialized goquery document)
Crawl delays applied per host
Obedience to robots.txt rules (using the robotstxt.go library)
Concurrent execution using goroutines
Configurable logging
Open, customizable design providing hooks into the execution logic

Heritrix

https://github.com/internetarchive/heritrix3/

Heritrix is the Internet Archive’s open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix/heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.

pyspider

http://docs.pyspider.org/en/latest/

A Powerful Spider(Web Crawler) System in Python. TRY IT NOW!

Write script in Python

Powerful WebUI with script editor, task monitor, project manager and result viewer

MySQL, MongoDB, Redis, SQLite, PostgreSQL with SQLAlchemy as database backend

RabbitMQ, Beanstalk, Redis and Kombu as message queue

Task priority, retry, periodical, recrawl by age, …

Distributed architecture, Crawl Javascript pages, Python 2&3, …

portia

https://github.com/scrapinghub/portia

Visual scraping for Scrapy

Portia is a tool that allows you to visually scrape websites without any programming knowledge required. With Portia you can annotate a web page to identify the data you wish to extract, and Portia will understand based on these annotations how to scrape data from similar pages.

pholcus

https://github.com/henrylee2cn/pholcus

Pholcus（幽灵蛛）是一款纯Go语言编写的重量级爬虫软件，清新的GUI界面，优雅的爬虫规则、可控的高并发、任意的批量任务、多种输出方式、大量Demo，更重要的是它支持socket长连接、全双工并发分布式，支持横纵向两种抓取模式，支持模拟登录和任务取消等。

webmagic

http://git.oschina.net/flashsword20/webmagic

WebMagic是一个简单灵活的爬虫框架。基于WebMagic，你可以快速开发出一个高效、易维护的爬虫。

特性：

简单的API，可快速上手

模块化的结构，可轻松扩展

提供多线程和分布式支持

neocrawler

http://git.oschina.net/dreamidea/neocrawler

NEOCrawler(中文名：牛咖)，是nodejs、redis、phantomjs实现的爬虫系统。代码完全开源，适合用于垂直领域的数据采集和爬虫二次开发。

【主要特点】

使用nodejs实现，javascipt简单、高效、易学、为爬虫的开发以及爬虫使用者的二次开发节约不少时间；nodejs使用Google V8作为运行引擎，性能可观；由于nodejs语言本身非阻塞、异步的特性，运行爬虫这类IO密集CPU需求不敏感的系统表现很出色，与其他语言的版本简单的比较，开发量小于C/C++/JAVA，性能高于JAVA的多线程实现以及Python的异步和携程方式的实现。调度中心负责网址的调度，爬虫进程分布式运行，即中央调度器统一决策单个时间片内抓取哪些网址，并协调各爬虫工作，爬虫单点故障不影响整体系统。爬虫在抓取时就对网页进行了结构化解析，摘取到需要的数据字段，入库时不仅是网页源代码还有结构化了的各字段数据，不仅使得网页抓取后数据立马可用，而且便于实现入库时的精准化的内容排重。集成了phantomjs。phantomjs是无需图形界面环境的网页浏览器实现，利用它可以抓取需要js执行才产生内容的网页。通过js语句执行页面上的用户动作，实现表单填充提交后再抓取下一页内容、点击按钮后页面跳转再抓取下一页内容等。重试及容错机制。http请求有各种意外情况，都有重试机制，失败后有详细记录便于人工排查。都返回的页面内容有校验机制，能检测到空白页，不完整页面或者是被代理服务器劫持的页面；可以预设cookie，解决需要登录后才能抓取到内容的问题。限制并发数，避免因为连接数过多被源网站屏蔽IP的问题。集成了代理IP使用的功能，此项功能针对反抓取的网站（限单IP下访问数、流量、智能判断爬虫的），需要提供可用的代理IP，爬虫会自主选择针对源网站还可以访问的代理IP地址来访问，源网站无法屏蔽抓取。产品化功能，爬虫系统的基础部分和具体业务实现部分架构上分离，业务实现部分不需要编码，可以用配置来完成。 Web界面的抓取规则设置，热刷新到分布式爬虫。在Web界面配置抓取规则，保存后会自动将新规则刷新给运行在不同机器上的爬虫进程，规则调整不需要编码、不需要重启程序。

spidr

http://spidr.rubyforge.org/

Description Spidr is a versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

Ebot

http://www.redaelli.org/matteo-blog/projects/ebot/

Erlang Bot (Ebot) is an opensource web crawler written on top of Erlang, a NOSQL database (Apache CouchDB or Riak), RabbitMQ, Webmachine (Mochiweb), RRDTOOL, .. Using a NOSQL instead of a Relational Database, Ebot can grow easily and cheaply… Ebot is a solid and highly scalable, distribuited and customizable web crawler.

cola

https://github.com/chineking/cola

Cola is a high-level distributed crawling framework, used to crawl pages and extract structured data from websites. It provides simple and fast yet flexible way to achieve your data acquisition objective. Users only need to write one piece of code which can run under both local and distributed mode.

go_spider

https://github.com/hu17889/go_spider

An awesome Go concurrent Crawler(spider) framework. The crawler is flexible and modular. It can be expanded to an Individualized crawler easily or you can use the default crawl components only.

Features

Concurrent
Fit for vertical communities
Flexible, Modular
Native Go implementation
Can be expanded to an individualized crawler easily

openwebspider

http://www.openwebspider.org/

OPEN SOURCE WEB SPIDER AND SEARCH ENGINE

为自己写代码

c4ys = Code for yourself

月度归档：2015年06月

nutch使用

准备环境

安装Nutch

配置Nutch

爬虫框架