月度归档:2016年01月

nutch + hbase + elasticsearch

安装 hbase

下载

不能下载太新的版本,下载地址在: http://mirror.bit.edu.cn/apache/hbase/

cd /data/server
wget http://mirror.bit.edu.cn/apache/hbase/hbase-0.94.27/hbase-0.94.27.tar.gz
tar zxvf hbase-1.1.2-bin.tar.gz
cd hbase-0.94.27

修改配置

修改conf/hbase-site.xml

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///data/data/hbase</value>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>false</value>
  </property>
</configuration>

启动

 bin/start-hbase.sh

运行

 bin/hbase shell

安装nutch

下载

下载地址: http://mirror.bit.edu.cn/apache/nutch/

cd /data/server
wget http://mirror.bit.edu.cn/apache/nutch/2.3/apache-nutch-2.3-src.tar.gz
tar -zxvf apache-nutch-2.3-src.tar.gz
cd apache-nutch-2.3

配置

ivy/ivy.xml

<dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" />  

conf/gora.properties

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

build

 ant clean
 ant runtime

config

runtime/local/conf/nutch-site.xml

<configuration>

  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.hbase.store.HBaseStore</value>
  </property>

  <property>
    <name>plugin.includes</name>
    <!-- do **NOT** enable the parse-html plugin, if you want proper HTML parsing. Use something like parse-tika! -->
    <value>protocol-httpclient|urlfilter-regex|parse-(text|tika|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-elastic</value>
  </property>

  <property>
    <name>db.ignore.external.links</name>
    <value>true</value>
    <!-- do not leave the seeded domains (optional) -->
  </property>

  <!-- elasticsearch index properties -->
  <property>
    <name>elastic.host</name>
    <value>localhost</value>
    <description>The hostname to send documents to using TransportClient. Either host and port must be defined or cluster.
    </description>
  </property>

  <property>
    <name>elastic.cluster</name>
    <value>elasticsearch</value>
    <description>The cluster name to discover. Either host and potr must be defined or cluster.
    </description>
  </property>

  <property>
    <name>parser.character.encoding.default</name>
    <value>utf-8</value>
  </property>

  <property>
    <name>http.agent.name</name>
    <value>Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36</value>
  </property>

  <property>
    <name>http.agent.description</name>
    <value>Programer's search</value>
  </property>

  <property>
    <name>http.robots.403.allow</name>
    <value>true</value>
  </property>

  <property>
    <name>http.agent.url</name>
    <value>http://hisearch.cn</value>
  </property>

  <property>
    <name>http.verbose</name>
    <value>true</value>
  </property>

  <property>
    <name>http.accept.language</name>
    <value>zh,zh-CN;q=0.8,en;q=0.6</value>
  </property>

  <property>
    <name>http.agent.version</name>
    <value>0.1</value>
  </property>

</configuration>

runtime/local/conf/hbase-site.xml

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///data/data/hbase</value>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>false</value>
  </property>
</configuration>

run

cd runtime/local
mkdir seed
echo "http://www.cnblogs.com" > seed/urls.txt
bin/nutch inject seed/
bin/nutch generate -topN 10
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb
bin/craw seed/ testCraw 3
bin/nutch elasticindex elasticsearch -all

python+logstash+elasticsearch+Kibana日志方案

python用的是python-logstash库https://github.com/vklochan/python-logstash

这一套用起来都比较方便.

logstash安装,配置,运行

下载并安装logstash

打开https://www.elastic.co/downloads/logstash,找到最新版下载链接,使用wget下载rpm,然后通过yum安装

wget https://download.elastic.co/logstash/logstash/packages/centos/logstash-2.1.1-1.noarch.rpm
yum install logstash-2.1.1-1.noarch.rpm

也可以直接通过repo安装:https://www.elastic.co/guide/en/logstash/current/package-repositories.html

配置logstash

vim /etc/logstash/conf.d/logstash.conf

input {  
  tcp {
    port => 5959
    codec => json
  }  
}
output {
  elasticsearch { hosts => ["localhost:9200"] }
}

运行logstash

chkconfig logstash on
/etc/init.d/logstash start
telnet 127.0.0.1 5959 #test

Python logstash

安装python包

pip install python-logstash

测试脚本 vim test.py

import logging
import logstash
import sys

host = 'localhost'

test_logger = logging.getLogger('python-logstash-logger')
test_logger.setLevel(logging.DEBUG)
test_logger.addHandler(logstash.TCPLogstashHandler(host, 5959, version=1))

test_logger.error('python-logstash: test logstash error message.')
test_logger.info('python-logstash: test logstash info message.')
test_logger.warning('python-logstash: test logstash warning message.')

extra = {
    'test_string': 'python version: ' + repr(sys.version_info),
    'test_boolean': True,
    'test_dict': {'a': 1, 'b': 'c'},
    'test_float': 1.23,
    'test_integer': 123,
    'test_list': [1, 2, '3'],
}
test_logger.info('python-logstash: test extra fields', extra=extra)

检查是否成功

curl http://127.0.0.1:9200/_search?pretty&q=logstash

kibana安装使用

https://www.elastic.co/downloads/kibana找到最新版本的kibana

wget https://download.elastic.co/kibana/kibana/kibana-4.3.1-linux-x64.tar.gz
tar -zxf kibana-4.3.1-linux-x64.tar.gz

vim config/kibana.yml,找到elasticsearch.url这行,根据情况决定是否要修改,如果修改记得去掉前面的注释符号

运行bin/kibana启动服务,访问http://127.0.0.1:5601/,点击创建即可

配置nginx访问

安装htpasswd工具,生成账号密码

 yum install httpd-tools
 htpasswd -b -c /data/kibana.htpasswd username password

配置nginx server

upstream kibana {
    server 127.0.0.1:5601 fail_timeout=0;
}

server {
    listen      80;
    server_name          kibana.domain.com;

    location / {
        auth_basic "Restricted";
        auth_basic_user_file /data/kibana.htpasswd;
        proxy_pass http://kibana;
   }
}

重启nginx

nginx -s reload

linux suspend wifi

lsmod | grep wifi

output

iwlwifi               200704  1 iwldvm
cfg80211              548864  3 iwlwifi,mac80211,iwldvm

sudo vim /etc/pm/config.d/config

SUSPEND_MODULES="iwldvm iwlwifi"

sudo service network-manager restart