准备环境
- 安装jdk
- 安装ant
安装Nutch
wget http://www.us.apache.org/dist/nutch/1.10/apache-nutch-1.10-bin.tar.gz
tar -zxvf apache-nutch-1.10-bin.tar.gz
cd apache-nutch-1.10
bin/nutch
配置Nutch
配置agent name
vim conf/nutch-site.xml
<property>
<name>http.agent.name</name>
<value>Test Nutch Spider</value>
</property>
配置要爬去的种子页面
mkdir -p urls
cd urls/
touch urls/seed.txt
echo 'http://stackoverflow.com/' > urls/seed.txt
配置要过滤的页面
vim conf/regex-urlfilter.txt
+^http://stackoverflow.com/