python rest Web browsable API

Django REST framework

Django REST framework is a powerful and flexible toolkit for building Web APIs.

Some reasons you might want to use REST framework:

The Web browsable API is a huge usability win for your developers.
Authentication policies including packages for OAuth1a and OAuth2.
Serialization that supports both ORM and non-ORM data sources.
Customizable all the way down – just use regular function-based views if you don’t need the more powerful features.
Extensive documentation, and great community support.
Used and trusted by large companies such as Mozilla and Eventbrite.

flask-restful-swagger

flask-restful-swagger is a wrapper for flask-restful which enables swagger support.

In essense, you just need to wrap the Api instance and add a few python decorators to get full swagger support.

Flask API

Flask API is an implementation of the same web browsable APIs that Django REST framework provides.

It gives you properly content negotiated responses and smart request parsing.

It is currently a work in progress, but the fundamentals are in place and you can already start building kick-ass browsable Web APIs with it. If you want to start using Flask API right now go ahead and do so, but be sure to follow the release notes of new versions carefully.

wikimedia导出

最近做搜索引擎性能测试,需要用到一些数据作测试,刚好发现wikimedia提供了xml的导出,而python是最好导入工具。

https://dumps.wikimedia.org/zhwiki/

nutch + hbase + elasticsearch

安装 hbase

下载

不能下载太新的版本,下载地址在: http://mirror.bit.edu.cn/apache/hbase/

cd /data/server
wget http://mirror.bit.edu.cn/apache/hbase/hbase-0.94.27/hbase-0.94.27.tar.gz
tar zxvf hbase-1.1.2-bin.tar.gz
cd hbase-0.94.27

修改配置

修改conf/hbase-site.xml

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///data/data/hbase</value>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>false</value>
  </property>
</configuration>

启动

 bin/start-hbase.sh

运行

 bin/hbase shell

安装nutch

下载

下载地址: http://mirror.bit.edu.cn/apache/nutch/

cd /data/server
wget http://mirror.bit.edu.cn/apache/nutch/2.3/apache-nutch-2.3-src.tar.gz
tar -zxvf apache-nutch-2.3-src.tar.gz
cd apache-nutch-2.3

配置

ivy/ivy.xml

<dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" />  

conf/gora.properties

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

build

 ant clean
 ant runtime

config

runtime/local/conf/nutch-site.xml

<configuration>

  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.hbase.store.HBaseStore</value>
  </property>

  <property>
    <name>plugin.includes</name>
    <!-- do **NOT** enable the parse-html plugin, if you want proper HTML parsing. Use something like parse-tika! -->
    <value>protocol-httpclient|urlfilter-regex|parse-(text|tika|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-elastic</value>
  </property>

  <property>
    <name>db.ignore.external.links</name>
    <value>true</value>
    <!-- do not leave the seeded domains (optional) -->
  </property>

  <!-- elasticsearch index properties -->
  <property>
    <name>elastic.host</name>
    <value>localhost</value>
    <description>The hostname to send documents to using TransportClient. Either host and port must be defined or cluster.
    </description>
  </property>

  <property>
    <name>elastic.cluster</name>
    <value>elasticsearch</value>
    <description>The cluster name to discover. Either host and potr must be defined or cluster.
    </description>
  </property>

  <property>
    <name>parser.character.encoding.default</name>
    <value>utf-8</value>
  </property>

  <property>
    <name>http.agent.name</name>
    <value>Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36</value>
  </property>

  <property>
    <name>http.agent.description</name>
    <value>Programer's search</value>
  </property>

  <property>
    <name>http.robots.403.allow</name>
    <value>true</value>
  </property>

  <property>
    <name>http.agent.url</name>
    <value>http://hisearch.cn</value>
  </property>

  <property>
    <name>http.verbose</name>
    <value>true</value>
  </property>

  <property>
    <name>http.accept.language</name>
    <value>zh,zh-CN;q=0.8,en;q=0.6</value>
  </property>

  <property>
    <name>http.agent.version</name>
    <value>0.1</value>
  </property>

</configuration>

runtime/local/conf/hbase-site.xml

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///data/data/hbase</value>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>false</value>
  </property>
</configuration>

run

cd runtime/local
mkdir seed
echo "http://www.cnblogs.com" > seed/urls.txt
bin/nutch inject seed/
bin/nutch generate -topN 10
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb
bin/craw seed/ testCraw 3
bin/nutch elasticindex elasticsearch -all

python+logstash+elasticsearch+Kibana日志方案

python用的是python-logstash库https://github.com/vklochan/python-logstash

这一套用起来都比较方便.

logstash安装,配置,运行

下载并安装logstash

打开https://www.elastic.co/downloads/logstash,找到最新版下载链接,使用wget下载rpm,然后通过yum安装

wget https://download.elastic.co/logstash/logstash/packages/centos/logstash-2.1.1-1.noarch.rpm
yum install logstash-2.1.1-1.noarch.rpm

也可以直接通过repo安装:https://www.elastic.co/guide/en/logstash/current/package-repositories.html

配置logstash

vim /etc/logstash/conf.d/logstash.conf

input {  
  tcp {
    port => 5959
    codec => json
  }  
}
output {
  elasticsearch { hosts => ["localhost:9200"] }
}

运行logstash

chkconfig logstash on
/etc/init.d/logstash start
telnet 127.0.0.1 5959 #test

Python logstash

安装python包

pip install python-logstash

测试脚本 vim test.py

import logging
import logstash
import sys

host = 'localhost'

test_logger = logging.getLogger('python-logstash-logger')
test_logger.setLevel(logging.DEBUG)
test_logger.addHandler(logstash.TCPLogstashHandler(host, 5959, version=1))

test_logger.error('python-logstash: test logstash error message.')
test_logger.info('python-logstash: test logstash info message.')
test_logger.warning('python-logstash: test logstash warning message.')

extra = {
    'test_string': 'python version: ' + repr(sys.version_info),
    'test_boolean': True,
    'test_dict': {'a': 1, 'b': 'c'},
    'test_float': 1.23,
    'test_integer': 123,
    'test_list': [1, 2, '3'],
}
test_logger.info('python-logstash: test extra fields', extra=extra)

检查是否成功

curl http://127.0.0.1:9200/_search?pretty&q=logstash

kibana安装使用

https://www.elastic.co/downloads/kibana找到最新版本的kibana

wget https://download.elastic.co/kibana/kibana/kibana-4.3.1-linux-x64.tar.gz
tar -zxf kibana-4.3.1-linux-x64.tar.gz

vim config/kibana.yml,找到elasticsearch.url这行,根据情况决定是否要修改,如果修改记得去掉前面的注释符号

运行bin/kibana启动服务,访问http://127.0.0.1:5601/,点击创建即可

配置nginx访问

安装htpasswd工具,生成账号密码

 yum install httpd-tools
 htpasswd -b -c /data/kibana.htpasswd username password

配置nginx server

upstream kibana {
    server 127.0.0.1:5601 fail_timeout=0;
}

server {
    listen      80;
    server_name          kibana.domain.com;

    location / {
        auth_basic "Restricted";
        auth_basic_user_file /data/kibana.htpasswd;
        proxy_pass http://kibana;
   }
}

重启nginx

nginx -s reload

linux suspend wifi

lsmod | grep wifi

output

iwlwifi               200704  1 iwldvm
cfg80211              548864  3 iwlwifi,mac80211,iwldvm

sudo vim /etc/pm/config.d/config

SUSPEND_MODULES="iwldvm iwlwifi"

sudo service network-manager restart

elasticsearch highlighting

PUT /my_index

{
  "mappings": {
    "doc_type": {
      "properties": {
        "content": {
          "type": "string",
          "term_vector": "with_positions_offsets",
          "analyzer": "snowball"
        }
      }
    }
  }
}

POST /_search

{
  "query": {
    "multi_match": {
      "query": "公司",
      "type": "best_fields",
      "fields": [
        "title",
        "content"
      ]
    }
  },
  "filter": {
    "term": {
      "site": "baidu.com"
    }
  },
  "highlight": {
    "fields": {
      "content": {
        "fragment_size": 100,
        "number_of_fragments": 2,
        "no_match_size": 100,
        "term_vector": "with_positions_offsets",
        "boundary_chars": " 。,?",
        "max_boundary_size": 80,
        "force_source": true
      }
    }
  }
}

使用elementtree处理大的xml

这个xml超过了30G,关键是iterparse以及e.clear()

from elementtree.ElementTree import iterparse
from datetime import datetime
import redis
import json
import lxml.html
import re
import traceback
import time
import cgi


def main():
    redisConn = redis.from_url("redis://localhost:6379/0")
    i = 0
    xmlfile = "/data/download/Posts.xml"
    for event, e in iterparse(xmlfile):
        if e.tag == "row" and e.get("PostTypeId") == "1":
            try:
                data = {
                    "url": "http://stackoverflow.com/questions/" + e.get("Id"),
                    "title": cgi.escape(e.get("Title")),
                    "content": cgi.escape(lxml.html.fromstring(e.get("Body")).text_content()),
                    "tags": cgi.escape(",".join(re.findall("<([^>]+)>", e.get("Tags")))),
                    'site': 'stackoverflow',
                    "timestamp": datetime.now().isoformat()
                }
                redisConn.lpush('ResultQueue', json.dumps(data))
            except:
                traceback.print_exc()
                print e.attrib
                continue
        i += 1
        if i % 1000 == 0:
            time.sleep(5);
            print i
        e.clear()


if __name__ == '__main__':
    main()

centos elasticsearch 安装

download and install via https://www.elastic.co/downloads/

yum install https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/rpm/elasticsearch/2.1.0/elasticsearch-2.1.0.rpm

make data and logs dir

mkdir -p /data/elastic/data
mkdir -p /data/elastic/logs
chown -R elasticsearch:elasticsearch /data/elastic/

edit config /etc/elasticsearch.yml

path.data: /data/elastic/data
path.logs: /data/elastic/logs
network.host: 127.0.0.1

edit start script /etc/init.d/elasticsearch

LOG_DIR="/data/elastic/logs"
DATA_DIR="/data/elastic/data"

install java jdk

yum install java-1.8.0-openjdk

start

systemctl enalbe elasticsearch
/etc/init.d/elasticsearch start

test

/etc/init.d/elasticsearch status
curl http://127.0.0.1:9200/