分类目录归档:Python

scrapy爬虫相关资料

Web Scraping and Crawling With Scrapy and MongoDB

django-dynamic-scraper

Django Dynamic Scraper (DDS) is an app for Django build on top of the scraping framework Scrapy. While preserving many of the features of Scrapy it lets you dynamically create and manage spiders via the Django admin interface.

Indexing web sites in Solr with Python

Scrapy at a glance

Building a Web Crawler with Scrapy

pypy+pip+virtualenv

安装pypy

yum install pypy*

安装pip

wget https://raw.github.com/pypa/pip/master/contrib/get-pip.py 
pypy get-pip.py 

添加pypy pip别名

alias pypy_pip='/usr/lib64/pypy-2.2.1/bin/pip'

建立virtualenv

virtualenv ~/pypy -p `which pypy` 
source ~/pypy/bin/activate 
python --version 
deactivate

amazon aws linux下python环境安装

允许远程通过密码登录

给root设置一个密码,允许远程通过密码登录

chmod 400 xxx.pem
ssh -i xxx.pem ec2-user@host-ip    
sudo passwd root
sudo su -
vim /etc/ssh/sshd_config 

/etc/ssh/sshd_config内容如下

PermitRootLogin yes
PasswordAuthentication yes
UsePAM yes

然后

service sshd reload

开启epel

amazon aws如果采用redhat需要额外收授权费,目前aws官方2014.3基于centos 6,许多软件版本较旧,可以通过epel软件仓库安装一些新软件

yum-config-manager --enable epel
yum update

安装基本环境

yum groupinstall "Development tools" -y
yum install openssl-devel libxslt-devel libxml2-devel libffi-devel -y

安装mysql客户端

vim /etc/yum.repos.d/MariaDB.repo

# MariaDB 5.5 CentOS repository list - created 2014-09-13 05:43 UTC
# http://mariadb.org/mariadb/repositories/
[mariadb]
name = MariaDB
baseurl = http://yum.mariadb.org/5.5/centos6-amd64
gpgkey=https://yum.mariadb.org/RPM-GPG-KEY-MariaDB
gpgcheck=1

安装MariaDB客户端

yum install MariaDB-client mysql-devel -y

安装redis客户端

rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm
rpm -Uvh http://rpms.famillecollet.com/enterprise/remi-release-6.rpm
yum-config-manager --enable remi
yum install redis -y

安装mongodb客户端

vim /etc/yum.repos.d/mongodb.repo

[mongodb]
name=MongoDB Repository
baseurl=http://downloads-distro.mongodb.org/repo/redhat/os/x86_64/
gpgcheck=0
enabled=1

安装mongodb客户端

yum update
yum install mongodb-org-shell -y

安装python环境

# yum install centos-release-SCL (aliyun only)
yum install python27*
yum install freetype-devel libjpeg-devel libpng-devel
# scl enable python27 bash (aliyun only)
virtualenv --no-site-packages /data/pyenv
source /data/pyenv/bin/activate
pip install redis cryptography sqlalchemy flask simplejson mongoengine python-amazon-product-api scrapy mysql-python gunicorn gevent

安装glusterfs客户端

#install glusterfs repo
wget -P /etc/yum.repos.d http://download.gluster.org/pub/gluster/glusterfs/LATEST/EPEL.repo/glusterfs-epel.repo
#fix it for amazon linux
sed -i 's/$releasever/6/g' /etc/yum.repos.d/glusterfs-epel.repo 

#install glusterfs
yum install -y glusterfs-fuse

#setup fstab
echo "172.31.42.77:/pcvol /data_pcvol glusterfs defaults,noatime 0 0" >> /etc/fstab

#mount
mkdir /data_pcvol
mount -a
ls /data_pcvol

挂载日志磁盘

lsblk
mkfs -t ext4 /dev/xvdf
echo '/dev/xvdf       /data_log   ext4    defaults,nofail        0       2' >> /etc/fstab
mkdir /data_log
mount -a

又拍云python sdk的一个bug

upyun的pysdk存在bug,会导致部分文件上传失败。测试图片

问题原因

部分图像处理软件会去掉图片的description,而python的fileno如果没有取到description会抛出一个异常,程序没有处理这个异常,从而导致了无法上传。 pysdk有判断这个方法是否存在,但是没有处理这个异常。 https://docs.python.org/3/library/io.html#io.IOBase.fileno

解决办法

打开sdk的upyun.py中的173行处,修改

    if hasattr(value, 'fileno'):
        length = os.fstat(value.fileno()).st_size
    elif hasattr(value, '__len__'):
        length = len(value)
        headers['Content-Length'] = length
    elif value is not None:
        raise UpYunClientException('object type error')

    if hasattr(value, 'getvalue'):
        length = len(value.getvalue())
        headers['Content-Length'] = length
    elif hasattr(value, '__len__'):
        length = len(value)
        headers['Content-Length'] = length
    elif hasattr(value, 'fileno'):
        length = os.fstat(value.fileno()).st_size
        headers['Content-Length'] = length
    elif value is not None:
        raise UpYunClientException('object type error')

python面试题

  • 介绍你最熟悉的一个py框架,以及他的主要功能及类库说明,开发流程,以及项目结构?

  • 请写一段py异常及日志处理的代码

  • 请写一段py读写数据库的代码

  • 用文字或者代码介绍下py中的列表生成式、装饰器、匿名函数以及函数式编程

  • 设计一个新闻数据库,有100万记录,分为50个分类,分类列表页面需要显示该分类的最新20条新闻,请设计出新闻表结构以及索引,请写出建表以及查询SQL;

使用mirror给pip提速

默认配置下,使用pip安装python package速度非常慢,这时候可以考虑使用镜像. http://www.pypi-mirrors.org/

vim ~/.pip/pip.conf

[global]
trusted-host = pypi.doubanio.com
index-url = https://pypi.doubanio.com/simple/

使用命令行

pip config --global set global.index-url https://pypi.doubanio.com/simple/
pip config --global set global.trusted-host pypi.doubanio.com

更详细的pip.conf配置在:

http://www.pip-installer.org/en/latest/user_guide.html#configuration