最近做搜索引擎性能测试,需要用到一些数据作测试,刚好发现wikimedia提供了xml的导出,而python是最好导入工具。
https://dumps.wikimedia.org/zhwiki/
最近做搜索引擎性能测试,需要用到一些数据作测试,刚好发现wikimedia提供了xml的导出,而python是最好导入工具。
https://dumps.wikimedia.org/zhwiki/
不能下载太新的版本,下载地址在: http://mirror.bit.edu.cn/apache/hbase/
cd /data/server
wget http://mirror.bit.edu.cn/apache/hbase/hbase-0.94.27/hbase-0.94.27.tar.gz
tar zxvf hbase-1.1.2-bin.tar.gz
cd hbase-0.94.27
修改conf/hbase-site.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///data/data/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>false</value>
</property>
</configuration>
bin/start-hbase.sh
bin/hbase shell
下载地址: http://mirror.bit.edu.cn/apache/nutch/
cd /data/server
wget http://mirror.bit.edu.cn/apache/nutch/2.3/apache-nutch-2.3-src.tar.gz
tar -zxvf apache-nutch-2.3-src.tar.gz
cd apache-nutch-2.3
ivy/ivy.xml
<dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" />
conf/gora.properties
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
ant clean
ant runtime
runtime/local/conf/nutch-site.xml
<configuration>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
</property>
<property>
<name>plugin.includes</name>
<!-- do **NOT** enable the parse-html plugin, if you want proper HTML parsing. Use something like parse-tika! -->
<value>protocol-httpclient|urlfilter-regex|parse-(text|tika|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-elastic</value>
</property>
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<!-- do not leave the seeded domains (optional) -->
</property>
<!-- elasticsearch index properties -->
<property>
<name>elastic.host</name>
<value>localhost</value>
<description>The hostname to send documents to using TransportClient. Either host and port must be defined or cluster.
</description>
</property>
<property>
<name>elastic.cluster</name>
<value>elasticsearch</value>
<description>The cluster name to discover. Either host and potr must be defined or cluster.
</description>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
</property>
<property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36</value>
</property>
<property>
<name>http.agent.description</name>
<value>Programer's search</value>
</property>
<property>
<name>http.robots.403.allow</name>
<value>true</value>
</property>
<property>
<name>http.agent.url</name>
<value>http://hisearch.cn</value>
</property>
<property>
<name>http.verbose</name>
<value>true</value>
</property>
<property>
<name>http.accept.language</name>
<value>zh,zh-CN;q=0.8,en;q=0.6</value>
</property>
<property>
<name>http.agent.version</name>
<value>0.1</value>
</property>
</configuration>
runtime/local/conf/hbase-site.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///data/data/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>false</value>
</property>
</configuration>
cd runtime/local
mkdir seed
echo "http://www.cnblogs.com" > seed/urls.txt
bin/nutch inject seed/
bin/nutch generate -topN 10
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb
bin/craw seed/ testCraw 3
bin/nutch elasticindex elasticsearch -all
python用的是python-logstash库https://github.com/vklochan/python-logstash
这一套用起来都比较方便.
下载并安装logstash
打开https://www.elastic.co/downloads/logstash,找到最新版下载链接,使用wget下载rpm,然后通过yum安装
wget https://download.elastic.co/logstash/logstash/packages/centos/logstash-2.1.1-1.noarch.rpm
yum install logstash-2.1.1-1.noarch.rpm
也可以直接通过repo安装:https://www.elastic.co/guide/en/logstash/current/package-repositories.html
配置logstash
vim /etc/logstash/conf.d/logstash.conf
input {
tcp {
port => 5959
codec => json
}
}
output {
elasticsearch { hosts => ["localhost:9200"] }
}
运行logstash
chkconfig logstash on
/etc/init.d/logstash start
telnet 127.0.0.1 5959 #test
安装python包
pip install python-logstash
测试脚本 vim test.py
import logging
import logstash
import sys
host = 'localhost'
test_logger = logging.getLogger('python-logstash-logger')
test_logger.setLevel(logging.DEBUG)
test_logger.addHandler(logstash.TCPLogstashHandler(host, 5959, version=1))
test_logger.error('python-logstash: test logstash error message.')
test_logger.info('python-logstash: test logstash info message.')
test_logger.warning('python-logstash: test logstash warning message.')
extra = {
'test_string': 'python version: ' + repr(sys.version_info),
'test_boolean': True,
'test_dict': {'a': 1, 'b': 'c'},
'test_float': 1.23,
'test_integer': 123,
'test_list': [1, 2, '3'],
}
test_logger.info('python-logstash: test extra fields', extra=extra)
检查是否成功
curl http://127.0.0.1:9200/_search?pretty&q=logstash
从https://www.elastic.co/downloads/kibana找到最新版本的kibana
wget https://download.elastic.co/kibana/kibana/kibana-4.3.1-linux-x64.tar.gz
tar -zxf kibana-4.3.1-linux-x64.tar.gz
vim config/kibana.yml,找到elasticsearch.url这行,根据情况决定是否要修改,如果修改记得去掉前面的注释符号
运行bin/kibana启动服务,访问http://127.0.0.1:5601/,点击创建即可
安装htpasswd工具,生成账号密码
yum install httpd-tools
htpasswd -b -c /data/kibana.htpasswd username password
配置nginx server
upstream kibana {
server 127.0.0.1:5601 fail_timeout=0;
}
server {
listen 80;
server_name kibana.domain.com;
location / {
auth_basic "Restricted";
auth_basic_user_file /data/kibana.htpasswd;
proxy_pass http://kibana;
}
}
重启nginx
nginx -s reload
sudo apt-get install libxslt-dev libxml2-dev libffi-dev libjpeg-dev libfreetype6-dev libpng-dev
lsmod | grep wifi
output
iwlwifi 200704 1 iwldvm
cfg80211 548864 3 iwlwifi,mac80211,iwldvm
sudo vim /etc/pm/config.d/config
SUSPEND_MODULES="iwldvm iwlwifi"
sudo service network-manager restart
PUT /my_index
{
"mappings": {
"doc_type": {
"properties": {
"content": {
"type": "string",
"term_vector": "with_positions_offsets",
"analyzer": "snowball"
}
}
}
}
}
POST /_search
{
"query": {
"multi_match": {
"query": "公司",
"type": "best_fields",
"fields": [
"title",
"content"
]
}
},
"filter": {
"term": {
"site": "baidu.com"
}
},
"highlight": {
"fields": {
"content": {
"fragment_size": 100,
"number_of_fragments": 2,
"no_match_size": 100,
"term_vector": "with_positions_offsets",
"boundary_chars": " 。,?",
"max_boundary_size": 80,
"force_source": true
}
}
}
}
dnf install gnome-shell-extensions-alternative-status-menu
这个xml超过了30G,关键是iterparse以及e.clear()
from elementtree.ElementTree import iterparse
from datetime import datetime
import redis
import json
import lxml.html
import re
import traceback
import time
import cgi
def main():
redisConn = redis.from_url("redis://localhost:6379/0")
i = 0
xmlfile = "/data/download/Posts.xml"
for event, e in iterparse(xmlfile):
if e.tag == "row" and e.get("PostTypeId") == "1":
try:
data = {
"url": "http://stackoverflow.com/questions/" + e.get("Id"),
"title": cgi.escape(e.get("Title")),
"content": cgi.escape(lxml.html.fromstring(e.get("Body")).text_content()),
"tags": cgi.escape(",".join(re.findall("<([^>]+)>", e.get("Tags")))),
'site': 'stackoverflow',
"timestamp": datetime.now().isoformat()
}
redisConn.lpush('ResultQueue', json.dumps(data))
except:
traceback.print_exc()
print e.attrib
continue
i += 1
if i % 1000 == 0:
time.sleep(5);
print i
e.clear()
if __name__ == '__main__':
main()
download and install via https://www.elastic.co/downloads/
yum install https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/rpm/elasticsearch/2.1.0/elasticsearch-2.1.0.rpm
make data and logs dir
mkdir -p /data/elastic/data
mkdir -p /data/elastic/logs
chown -R elasticsearch:elasticsearch /data/elastic/
edit config /etc/elasticsearch.yml
path.data: /data/elastic/data
path.logs: /data/elastic/logs
network.host: 127.0.0.1
edit start script /etc/init.d/elasticsearch
LOG_DIR="/data/elastic/logs"
DATA_DIR="/data/elastic/data"
install java jdk
yum install java-1.8.0-openjdk
start
systemctl enalbe elasticsearch
/etc/init.d/elasticsearch start
test
/etc/init.d/elasticsearch status
curl http://127.0.0.1:9200/
make data dir for redis
mkdir /data/redis
chown -R redis:redis /data/redis
modify config /etc/redis.conf
daemonize yes
dir /data/redis/
appendonly yes
requirepass mypassword
restart
systemctl enalbe redis
systemctl start redis