异步mq的窘境

2018-06-29

目前在开发的内部爬虫框架中，对于从消息队列中取任务部分，如下图所示，遇到了一些问题。

因为我们整个模型为async的，底层使用的asyncio，对于消息队列的客户端来讲，可选择有的pika和kombu，对于kombu，的确是一个不错的选择，相比pika callback的写法，的确封装的更为高级，写起来比较方便，另外官方支持与eventlet工作，例如openstack的nova底层使用的就是eventlet与kombu，但是kombu目前不支持asyncio,官方将于5.0版本提供支持,我很看好，题外话。

那么目前能够选择貌似只有pika了，官方示例,奈何实力不高，只能在callback的基础上加代码，此处分成两个版本，如下所示。

推入线程池/进程池


class Async PikaConsumer(object):
    def on_message(self, unused_channel, basic_deliver, properties, body):
        """Invoked by pika when a message is delivered from RabbitMQ. The
        channel is passed for your convenience. The basic_deliver object that
        is passed in carries the exchange, routing key, delivery tag and
        a redelivered flag for the message. The properties passed in is an
        instance of BasicProperties with the message properties and the body
        is the message that was sent.

        :param pika.channel.Channel unused_channel: The channel object
        :param pika.Spec.Basic.Deliver: basic_deliver method
        :param pika.Spec.BasicProperties: properties
        :param str|unicode body: The message body

        """
        LOGGER.info('Received message # %s from %s: %s',
                    basic_deliver.delivery_tag, properties.app_id, body)

        # 将获取到的消息推入线程池/进程池中进行处理
        self._connection.ioloop.loop.run_in_executor(crawl.dispatch)  #此处传入dispatch...
        # 直接确认消息，不拒绝消息，不会重新入队列
        self.acknowledge_message(basic_deliver.delivery_tag)

class Crawl(object):
    def dispatch(unused_channel, basic_deliver, properties, body):
        try:
            if body["type"] == "list":
                return self.get_list(body)
            return self.get_data(body)
        except Exception as e:
            # todo something.

    def get_list(self,task):
        pass

    def get_data(self,task):
        pass

异步取来的任务，将其抛给线程池/进程池处理，那么上层应用者直接在get_list和get_data中进行处理，那么此处就有非常大的问题了，因为一个好好的异步模型活生生的给改成了有点同步的感觉了，如果我想在业务层执行异步任务，会发现在当前线程中无法get eventloop,哈哈，好尴尬。这点实际在tornado官方文档中描述如何执行同步代码中有提示，是不是和这个很类似.


executor = concurrent.futures.ThreadPoolExecutor(8)

class ThreadPoolHandler(RequestHandler):
    @gen.coroutine
    def get(self):
        for i in range(5):
            print(i)
            yield executor.submit(time.sleep, 1)

异步处理

上面的实在受不了，太烂了，所以此处还是要想办法给异步了，此处斜眼。
此前在看nameko中消费者处理时，他是使用eventlet.spawn方法开启一个新的协程进行处理，从而不阻塞当前loop,那么在asyncio中也一定有相应的方法，下面出场asyncio.async和asyncio.ensure_future方法，其实也是一个方法，asyncio.async将被放弃啦，所以所以介绍ensure_future方法。


class Async PikaConsumer(object):
    def on_message(self, unused_channel, basic_deliver, properties, body):
        """Invoked by pika when a message is delivered from RabbitMQ. The
        channel is passed for your convenience. The basic_deliver object that
        is passed in carries the exchange, routing key, delivery tag and
        a redelivered flag for the message. The properties passed in is an
        instance of BasicProperties with the message properties and the body
        is the message that was sent.

        :param pika.channel.Channel unused_channel: The channel object
        :param pika.Spec.Basic.Deliver: basic_deliver method
        :param pika.Spec.BasicProperties: properties
        :param str|unicode body: The message body

        """
        LOGGER.info('Received message # %s from %s: %s',
                    basic_deliver.delivery_tag, properties.app_id, body)

        # 此处改变了哦
        asyncio.ensure_future(self.deal_message(unused_channel, basic_deliver, properties, body))
        # 直接确认消息，不拒绝消息，不会重新入队列
        self.acknowledge_message(basic_deliver.delivery_tag)

 
class Crawl(object):
    config = {"CONCURRENT":2}
    async def dispatch(self,u,b,p, body):
        # 简化写了，，，明白流程就行啦...
        await self.get_list(json.loads(body.decode()))

    async def get_list(self,body):
        #time.sleep(3)

        try:
            with aiohttp.ClientSession() as session:
                resp = await session.get(body["url"], timeout=10)
                body = await resp.text()
                print("异步处理body:", resp.status)
        except:
            print("超时")

由于使用的pika连接器是asyncio的，那么根据pika的官方文档描述，获取到的任务只有在完成的时候才会进行下发新的任务，
如果如上get_list方法下面的使用者写的是同步代码，会导致效率非常地下，所以此处会强制提升业务代码至异步模型，貌似有点激进，所以暂时不更新.

同步处理

额，几天没有更新了，经过几天的思考，目前采用多线程/多进程模型，为什么没有使用异步呢？我觉得可以从一下几方面总结：

关于消息队列客户端，没有异步支持的客户端，kazoo亦是如此，但是这两者都有gevent、eventlet的支持，为什么不使用呢？因为在看openstack官方论坛以及asyncio的发展趋势，更应该顺应技术发展，如果kombu明年支持了，可以自行再重新实现一遍，整体架构会比目前的更为成熟.
代理隧道数量有限，其实这个不应该考虑到框架层面上的，因为这个可以通过并发数(目前通过信号量控制)控制，目前没有达到必须使用异步的地步，去加速速度或者减轻资源的消耗.
满足目前的整体需求，另外使用同步的话整个爬虫团队更方便和他们熟悉的软件工具配合使用，例如chrome headless、selenium、splash等等，如果使用异步的话，还要把他run_in_executor中，还要在同步处理上进一步封装，也比较麻烦.
真正使用时遇到了其他的问题. 因为我们list和data任务推入同一个消息队列中，导致list任务会非常多导致data任务很难被消费，以及消息堆积导致最新消息不能被及时消费。由于消息队列是FIFO模型，由于生产速度大于消费速度，导致迟迟无法看到data结果，这一方面准备改成两个队列，list和data队列,data队列优先级提高，加速获取结果.
另外由于list任务生产速度过快，如何加速处理呢，可以使用多个客户端同时处理，由于有去重，那么就会加速消费速度.

先把第一个版本做稳定了，因为目前还是有一些问题的，因为分布式爬虫框架，程序异常退出以及退出迟迟没有在server端看到客户端下线，还要定位原因以及加强处理，等稳定后再加入timer等其他功能。

展开全文 >>

python中不得不说的装饰器

2018-06-19

常用以及常看见的写法


from functools import wraps

def cal_time(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        return func(*args, **kwargs)
    return wrapper

给装饰器传参


from functools import wraps

def cal_time(max_time):
    def wrapper_func(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            return func(*args, **kwargs)
        return wrapper
    return wrapper_func

但是今天将介绍另外两种不同的写法，而这两种写法也在库中比较常见　

wrapper callback

class App(object):
    def __init__(self):
        self.exc = None

    def on_server_error(self, callback):
        # 在这个地方可以做一些其他的东西
        self.exc = callback


if __name__ == "__main__":
    app = App()

    @app.on_server_error
    def deal_server_error():
        print("deal server error")

    try:
        raise ValueError("服务出现异常")
    except Exception:
        app.exc()

flask-like decorator

为什么叫flask-like呢？因为我也没有想到好的名字来叫，另外这种写法也是在flask中见，因为他没有包装func(*args, **kwargs)这层，而是仅给func传递参数．


import sys

class App(object):
    def __init__(self):
        self.exc = []

    def errorhandler(self, code):
        def decorator(f):
            self.exc.append(f)
            return f

        return decorator


if __name__ == '__main__':
    app = App()


    @app.errorhandler(403)
    def deal_error(e):
        print("403 error.")


    def deal_exception(e):
        exc_type, exc_value, tb = sys.exc_info()
        assert exc_value is e
        print(exc_type, exc_value, tb)
        for deco in app.exc:
            deco(exc_type)


    try:
        raise ValueError("dddd")
    except Exception as e:
        deal_exception(e)

展开全文 >>

cached_property的理解与使用

2018-05-25

今天在同事的推荐下,看了下扇贝的sea代码,
没细看,突然看到了cached_property的代码,这个让我突然想到了我们内部爬虫框架的cached_property,当时我在写这部分代码的时候主要目的有点类似如下:

class A(object):
    def __init__(self):
        pass

    def random(self):
        return random.random()


class B(object):

    @cached_property
    def a(self):
        return A()

A为一个类,然后B有点类似A的一个超集,平常使用的使用为实例化B, 然后为了操作A下面的方法,基本流程为:

1
2
3

>>> b = B()
>>> b.a.random()
>>> b.a.other_method()

当时设计的时候，并没有想到通过property这个将方法变成属性的方法，当时想我只需要初始化两次就行，如下:


>>> b = B()
>>> a = b.a()
>>> a.random()
>>> a.other_method()

但是后来我就意识到我这种设计有点low,因为其他同事调用的时候可能不会这么使用，另外因为B下面还会有其他类似A这种东西，例如:

class B(object):

    @cached_property
    def a(self):
        return A()

    @cached_property
    def c(self):
        return C()

难道每次使用的时候都拿到c这个实例吗，有点傻，所以参考了Flask的cached_property实现，改成了如上了流程．那么进入主题，cached_property到底干了什么事情?

先看Flask的实现:



class _Missing(object):

    def __repr__(self):
        return 'no value'

    def __reduce__(self):
        return '_missing'


_missing = _Missing()


class cached_property(property):
    def __init__(self, func, name=None, doc=None):
        self.__name__ = name or func.__name__
        self.__module__ = func.__module__
        self.__doc__ = doc or func.__doc__
        self.func = func

    def __set__(self, obj, value):
        obj.__dict__[self.__name__] = value

    def __get__(self, obj, type=None):
        if obj is None:
            return self
        value = obj.__dict__.get(self.__name__, _missing)
        if value is _missing:
            value = self.func(obj)
            obj.__dict__[self.__name__] = value
        return value

再看sea的实现:


class cached_property:
    """ thread safe cached property """

    def __init__(self, func, name=None):
        self.func = func
        self.__doc__ = getattr(func, '__doc__')
        self.name = name or func.__name__
        self.lock = Lock()

    def __get__(self, instance, cls=None):
        with self.lock:
            if instance is None:
                return self
            try:
                return instance.__dict__[self.name]
            except KeyError:
                res = instance.__dict__[self.name] = self.func(instance)
                return res

唯一不同点应该就在于加了一个锁吧，那么抛去锁的部分，单纯讲cached_propery的实现


class B(object):
    @cached_property
    def b(self):
        pass

第一步

学过Python的应该蛮清楚关于装饰器这个概念，当将cached_property加在b上时，就已经完成了cached_property类的实例化(看最后一个类版本的计算时间装饰器)，那怎么传进去的呢？



class cached_property:
    """ thread safe cached property """

    def __init__(self, func, name=None):
        self.func = func # 这个func对应上面的就为b
        self.__doc__ = getattr(func, '__doc__') # None
        self.name = name or func.__name__ # func.__name__ 为b

第二步

调用的时候怎么个过程?看Python Document

class Property(object):
    "Emulate PyProperty_Type() in Objects/descrobject.c"

    def __init__(self, fget=None, fset=None, fdel=None, doc=None):
        self.fget = fget
        self.fset = fset
        self.fdel = fdel
        if doc is None and fget is not None:
            doc = fget.__doc__
        self.__doc__ = doc

    def __get__(self, obj, objtype=None):
        if obj is None:
            return self
        if self.fget is None:
            raise AttributeError("unreadable attribute")
        return self.fget(obj)

    def __set__(self, obj, value):
        if self.fset is None:
            raise AttributeError("can't set attribute")
        self.fset(obj, value)

    def __delete__(self, obj):
        if self.fdel is None:
            raise AttributeError("can't delete attribute")
        self.fdel(obj)

    def getter(self, fget):
        return type(self)(fget, self.fset, self.fdel, self.__doc__)

    def setter(self, fset):
        return type(self)(self.fget, fset, self.fdel, self.__doc__)

    def deleter(self, fdel):
        return type(self)(self.fget, self.fset, fdel, self.__doc__)

def __get__(self, instance, cls=None):
        # instance代表实例化B
        with self.lock:
            if instance is None:
                return self
            try:
                return instance.__dict__[self.name]
            except KeyError: 
                # 如果没有在B的__dict__找到self.name(即方法b), 那么self.func(即方法b)则会执行b(B), 并将其保存到实例化b的__dict__下
                res = instance.__dict__[self.name] = self.func(instance)
                return res

总结

class B(object):

    @cached_property
    def a(self):
        return A()

    @cached_property
    def c(self):
        return C()


>>> b = B()
>>> for i in range(3):
        print(b.__dict__)
        b.a.random()

剩下自行理解吧..

# 类版本的计算时间的装饰器
class cal_time(object):
    def __init__(self, func):
        self.func = func

    def __call__(self, *args, **kwargs):
        s = time.time()
        resp = self.func(*args, **kwargs)
        e = time.time()
        print("使用时长:{}".format(e - s))
        return resp

展开全文 >>

SQLAlchemy示例:AdjacencyList,单向链接列表

2018-04-20

这里说的 AdjacencyList , 就是最常用来在关系数据库中表示树结构的,parent方式:

id	name	parent
1	一	null
2	二	1
3	三	2

上面的数据, 表示的结构就是:
一
|- 二
|- 三
模型定义很好做:

# -*- coding: utf-8 -*-

from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, ForeignKey
from sqlalchemy.types import Integer, Unicode
from sqlalchemy.orm import relationship, sessionmaker, joinedload

BaseModel = declarative_base()
Engine = create_engine('sqlite://', echo=True)
Session = sessionmaker(Engine)

class Node(BaseModel):
    __tablename__ = 'node'

    id = Column(Integer, autoincrement=True, primary_key=True)
    name = Column(Unicode(32), nullable=False, server_default='')
    parent = Column(Integer, ForeignKey('node.id'), index=True,
                    nullable=False, server_default='0')

这里不让parent字段有null, 而使用0代替.
这个例子在关系上, 有一个纠结的地方, 因为 node 这个表, 它是自关联的, 所以如果想要children 和 parent_obj 这两个关系时:

1 2	children = relationship('Node') parent_obj = relationship('Node')

呃, 尴尬了.
如果是两个表, 那么 SQLAlchemy 可以通过外键在哪张表这个信息, 来确定关系的方向:

class Blog(BaseModel):
    ...
    user = Column(Integer, ForeignKey('user.id'))
    user_obj = relationship('User')

class User(BaseModel):
    ...
    blog_list = relationship('Blog')

因为外键在 Blog 中, 所以 Blog -> User 的 user_obj 是一个 N -> 1关系.
反之, User -> Blog 的 blog_list 则是一个 1 -> N 的关系.
而自相关的 Node 无法直接判断方向, 所以 SQLAlchemy 会按 1 -> N 处理, 那么:

1 2	children = relationship('Node') parent_obj = relationship('Node')

这两条之中, children 是正确的, 是我们想要的. 要定义 parent_obj 则需要在 relationship 中通过参数明确表示方向:

1	parent_obj = relationship('Node', remote_side=[id])

这种方式就定义了一个, “到 id” 的 N -> 1 关系.
现在完整的模型定义是:

class Node(BaseModel):
    __tablename__ = 'node'

    id = Column(Integer, autoincrement=True, primary_key=True)
    name = Column(Unicode(32), nullable=False, server_default='')
    parent = Column(Integer, ForeignKey('node.id'), index=True,
                    nullable=False, server_default='0')

    children = relationship('Node') # 1 -> N
    parent_obj = relationship('Node', remote_side=[id])

查询方面没什么特殊的了, 不过我发现在自相关的模型关系, lazy 选项不起作用:

1 2	children = relationship('Node', lazy="joined") parent_obj = relationship('Node', remote_side=[id], lazy="joined")

都是无效的, 只有在查询时, 手动使用 options() 定义:

1 2	n = session.query(Node).filter(Node.name==u'一')\ .options(joinedload('parent_obj')).first()

如果要一次查出多级的子节点:

1
2
3

n = session.query(Node).filter(Node.name==u'一')\
           .options(joinedload('children').joinedload('children')).first()
print n.name, n.children, n.children[0].children

多个 joinedload() 串连的话, 可以使用 joinedload_all() 来整合:

from sqlalchemy.orm import joinedload_all

n = session.query(Node).filter(Node.name==u'一')\
           .options(joinedload_all('children', 'children')).first()

在修改方面, 删除的话, 配置了 cascade , 删除父节点, 则子节点也会自动删除:

children = relationship('Node', lazy='joined', cascade='all') # 1 -> N
node = session.query(Node).filter(Node.name == u'一').first()
session.delete(node)
session.commit()

如果只删除子节点, 那么 delete-orphan 选项就很好用了:

children = relationship('Node', lazy='joined', cascade='all, delete-orphan') # 1 -> N
node = session.query(Node).filter(Node.name == u'一').first()
node.children = []
session.commit()

展开全文 >>

关于Python的mock

2018-04-17

示例

每次使用mock都记不住，今天也是这样，但是这里出现了不一样的地方，先举个例子:

from mock import Mock, patch
import requests


class User(object):
    def __init__(self):
        """

        """


class Admin(User):
    table = "admin"

    def __init__(self):
        super(Admin, self).__init__()

    def list(self):
        return []

# 对类进行mock
@patch("__main__.Admin")
def do(cls):
    mock = Mock(table="Admin")
    mock.list.return_value = ["A", ]
    cls.return_value = mock # __init__

    a = Admin()
    return a.list()

# 对方法进行mock
def send_request():
    return requests.request('GET', "http://www.baidu.com/")


@patch("__main__.send_request")
def do2(method):
    mock = Mock(status_code=200)
    mock.json.return_value = {}

    method.return_value = mock

    a = send_request()
    return a.status_code, a.json()


if __name__ == '__main__':
    l = do()
    status_code, json = do2()
    print(l)
    print(status_code, json)

where to patch

上述看也没什么问题，但是此处讲一个Where to patch的问题.

简单描述:

a.py

    -> Defines SomeClass

b.py

    -> from a import SomeClass
    -> some_function instantiates SomeClass

如上，a.py定义了一个SomeClass,然后b.py引入这个Class,然后在某个地方进行实例化此类，ok，如果要patch SomeClass这个类，要从a.py进行patch呢还是从b.py进行patch呢？
例如:


# b.py

from a import SomeClass

def func():
    s = SomeClass()
    s.do_something()


# test.py

class Test1(TestCase):
    # 这里对？
    @patch("a.SomeClass")
    def do1(self,cls):
        # todo
    # 还是这里对？
    @patch("b.SomeClass")
    def do2(self,cls):
        # todo

上述哪一种方法才能被正常mock??

引用外文的描述:

Now we want to test some_function but we want to mock out SomeClass using patch. The problem is that when we import module b, which we will have to do then it imports SomeClass from module a. If we use patch to mock out a.SomeClass then it will have no effect on our test; module b already has a reference to the real SomeClass and it looks like our patching had no effect.

The key is to patch out SomeClass where it is used (or where it is looked up ). In this case some_function will actually look up SomeClass in module b, where we have imported it. The patching should look like:

@patch(‘b.SomeClass’)
However, consider the alternative scenario where instead of from a import SomeClass module b does import a and some_function uses a.SomeClass. Both of these import forms are common. In this case the class we want to patch is being looked up on the a module and so we have to patch a.SomeClass instead:

@patch(‘a.SomeClass’)

翻译成中文(翻译不好-_-!)


现在我们想使用patch mock SomeClass测试some_function，但是问题是，当我们导入模块b时，我们将不得不从模块a导入SomeClass,如果我们想patch a.SomeClass发现他对于我们的测试没有任何影响，模块b已经从真正的SomeClass中引用，看起来我们的patch没有被影响。

关键是SomeClass从哪里进行patch,或者从哪里进行查找。在这个示例some_function会从模块b中查找SomeClass,我们也已经导入了，故修改后看起来应该是这样:

@patch(‘b.SomeClass’)

后面就不翻译了...

常见的一些mock

mock class staticmethod


class Payroll(object):
    @staticmethod
    def hello():
        pass

@patch("where.Payroll.hello")
def test_hello(self, mock_hello):
    mock_hello.return_value = lambda x: x # example

如何mock一个异常

with patch(
    "addons.hcm.models.payroll.grpc_service.call"
) as mock_ar:
    mock_ar.side_effect = Exception("i wanted.")

mock class method


class UserResource(object):
    def _add(self):
        pass


@mock.patch.object(UserResource, "_add")
def test_mock_add(self, mock_add):
    mock_add.return_value = {} # anything u want.

展开全文 >>

bloomfilter

2018-04-16

概念

创建一个m位的位数组(bitmap),先将所有的位数组初始化为0。然后选择k个不同的哈希函数。第i个哈希函数对应的字符串str哈希的结果记为h(i, str),且h(i, str)的范围要在0至m-1。如下图所示。

如何判断字符串是否存在呢？
字符串也经过h(i, str),h(2, str),h(3, str)…哈希映射，检查每一个映射到m位的位数组上是否为1，如果不全为1，则表示一定不存在，否则，不能说明完全存在，有误差率在里面，为什么不能说一定存在呢，没看懂，可看：BloomFilters

关于容错率

既然不能完全表示存在，那么如何计算这个误差率呢？一共有三个参数:k,m,n。

参数	表示
k	哈希个数
m	位数组大小
n	字符串个数

下图表示m/n的结果与k个哈希函数的选择出现的错误率表格。

如上，举例简单说明，如果声明一个为数组大小为2，只传入一个字符串，那么m/n为2，如果选择k个哈希，导致的容错率分别是1.39,0.393,0.400。ok，如果要存1亿个字符串，那么大概为多少呢？

简单计算：如果容错率要求为5.73e-06,那么m/n=32,如果n为1亿的话，那么m为32亿，32亿 / 8/ 1024/1024=381.4697265625MB内存。
还是相当可以的。关于更多的容错率，可以看BloomFilters。

此处有个在线计算器

后续

选择使用的包
最终效果

展开全文 >>

SQLAlchemy混合属性机制

2018-04-14

直接行为

混合属性, 官方文档中称之为Hybrid Attributes. 这种机制表现为, 一个属性, 在类和层面, 和实例的层面, 其行为是不同的. 之所以需要关注这部分的差异, 原因源于 Python 上下文和 SQL 上下文的差异.
类层面经常是作为 SQL 查询时的一部分, 它面向的是 SQL 上下文. 而实例是已经得到或者创建的结果, 它面向的是 Python 上下文.
定义模型的 Column() 就是一个典型的混合属性. 作为实例属性时, 是具体的对象值访问, 而作为类属性时, 则有构成 SQL 语句表达式的功能.

class Interval(BaseModel):
    __tablename__ = 'interval'

    id = Column(Integer, autoincrement=True, primary_key=True)
    start = Column(Integer)
    end = Column(Integer)

session.add(Interval(start=0, end=100))
session.commit()

实例行为:

1 2	ins = session.query(Interval).first() print ins.end - ins.start

类行为:

1	ins = session.query(Interval).filter(Interval.end - Interval.start > 10).first()

这种机制其实一直在被使用, 但是可能大家都没有留意一个属性在类和实例上的区别.
如果属性需要被进一步封装, 那么就需要明确声明Hybrid Attributes了:

from sqlalchemy.ext.hybrid import hybrid_property, hybrid_method

class Interval(BaseModel):
    __tablename__ = 'interval'

    id = Column(Integer, autoincrement=True, primary_key=True)
    start = Column(Integer)
    end = Column(Integer)

    @hybrid_property
    def length(self):
        return self.end - self.start

    @hybrid_method
    def bigger(self, i):
        return self.length > i


session.add(Interval(start=0, end=100))
session.commit()

ins = session.query(Interval).filter(Interval.length > 10).first()
ins = session.query(Interval).filter(Interval.bigger(10)).first()
print ins.bigger(1)

setter的定义同样使用对应的装饰器即可:

class Interval(BaseModel):
    __tablename__ = 'interval'

    id = Column(Integer, autoincrement=True, primary_key=True)
    start = Column(Integer)
    end = Column(Integer)

    @hybrid_property
    def length(self):
        return abs(self.end - self.start)

    @length.setter
    def length(self, l):
        self.end = self.start + l

表达式行为

前面说的属性, 在类和实例上有不同行为, 可以看到, 在类上的行为, 其实就是生成 SQL 表达式时的行为. 上面的例子只是简单的运算, SQLAlchemy 可以自动处理好 Python 函数和 SQL 函数的区别. 但是如果是一些特性更强的 SQL 函数, 就需要手动指定了. 于时, 这时的情况变成, 实例行为是 Python 范畴的调用行为, 而类行为则是生成SQL 函数的相关表达式.
同时是前面的例子, 对于 length 的定义, 更严格上来说, 应该是取绝对值的.

class Interval(BaseModel):
    __tablename__ = 'interval'

    id = Column(Integer, autoincrement=True, primary_key=True)
    start = Column(Integer)
    end = Column(Integer)

    @hybrid_property
    def length(self):
        return abs(self.end - self.start)

但是, 如果使用了 Python 的abs()函数, 在生成 SQL 表达式时显示有无法处理了. 所以, 需要手动定义:

from sqlalchemy import func

class Interval(BaseModel):
    __tablename__ = 'interval'

    id = Column(Integer, autoincrement=True, primary_key=True)
    start = Column(Integer)
    end = Column(Integer)

    @hybrid_property
    def length(self):
        return abs(self.end - self.start)

    @length.expression
    def length(self):
        return func.abs(self.end - self.start)

这样查询时就可以直接使用:

1	ins = session.query(Interval).filter(Interval.length > 1).first()

对应的 SQL :

SELECT *
FROM interval
WHERE abs(interval."end" - interval.start) > ?
 LIMIT ? OFFSET ?

应用于关系

总体上没有特别之处:

class Account(BaseModel):
    __tablename__ = 'account'

    id = Column(Integer, autoincrement=True, primary_key=True)
    user = Column(Integer, ForeignKey('user.id'), index=True)
    balance = Column(Integer, server_default='0')


class User(BaseModel):
    __tablename__ = 'user'

    id = Column(Integer, autoincrement=True, primary_key=True)
    name = Column(Unicode(32), nullable=False, server_default='')

    accounts = relationship('Account')
    #balance = association_proxy('accounts', 'balance')

    @hybrid_property
    def balance(self):
        return sum(x.balance for x in self.accounts)

查询时:

1 2	user = session.query(User).first() print user.balance

这里涉及的东西都是 Python 自己的, 包括那个sum()函数, 和SQL没有关系.
如果想实现的是, 使用SQL的sum()函数, 取出指定用户的总账户金额数, 那么就要考虑把balance 作成表达式的形式:

from sqlalchemy import select

@hybrid_property
def balance(self):
    return select([func.sum(Account.balance)]).where(Account.user == self.id).label('balance_v')
    #return func.sum(Account.balance)

这样的话,User.balance只是单纯的一个表达式了, 查询时指定字段:

1 2	user = session.query(User, User.balance).first() print user.balance

注意, 如果写成:

1	session.query(User.balance).first()

意义就不再是”获取第一个用户的总金额”, 而变成”获取总金额的第一个”. 这里很坑吧.
像上面这样改, 实例层面就无法使用 balance 属性. 所以, 还是先前介绍的, 表达式可以单独处理:

@hybrid_property
def balance(self):
    return sum(x.balance for x in self.accounts)

@balance.expression
def balance(self):
    return select([func.sum(Account.balance)]).where(Account.user == self.id).label('balance_v')

定义了表达式的 balance , 这部分作为查询条件上当然也是可以的:

1	user = session.query(User).filter(User.balance > 1).first()

展开全文 >>

python多线程教程:并发与并行

2018-04-11

各种激烈的讨论都会经常的提到Python的多线程工作是多么的困难，指向被称为全局解释器锁阻碍了Python代码不能多线程同时执行。由于这个，如果你不是一个Python开发者以及你来自其他语言例如C++活着Java,线程模块表现出来的行为可能不是你所期望的那样。但是必须指出的是你仍可以写出并发或者并行的代码并使其表现出色，只要考虑全面。如果你还没有读到，我建议你看一下 Eqbal Quran’s 在Toptal博客上的article on concurrency and parallelism in Ruby。

在Python并发教程中，我们将会写出一个简单的Python脚本去下载来自Imgur排行靠前的图片，我们将从一个顺序下载的版本开始，或一次一个，作为先决条件，你将不得不注册一个应用在Imgur,如果你还没有创建一个Imgur账户，请先创建一个。

这个教程的脚本已经在Python3.4.2上测试通过，稍微作修改，就可以运行在Python2，urllib改动最大在这两个Python版本上。

开始在Python上使用多线程

让我们开始创建一个Python模块，并将其命名成”download.py”, 这个文件将会包含所有需要的函数去获取图片列表以及下载它们。我们将会分开这些函数为三个分离的函数。

get_links
download_link
setup_download_dir

第三个函数, “setup_download_dir”, 将会被用于创建下载保存目录如果不存在。

Imgur’s ApI要求HTTP请求头上携带客户端ID的”Authorization”标头，你可以从已经注册的Imgur的应用面板上查看客户端ID，请求返回结果会是JSON格式。我们可以使用Python的标准JSON库去解码。下载这些图片是一个非常简单的任务，只要我们获取到这些图片的URL以及将它们写入文件。

这个脚本看起来是这样的:

import json
import logging
import os
from pathlib import Path
from urllib.request import urlopen, Request

logger = logging.getLogger(__name__)

def get_links(client_id):
   headers = {'Authorization': 'Client-ID {}'.format(client_id)}
   req = Request('https://api.imgur.com/3/gallery/', headers=headers, method='GET')
   with urlopen(req) as resp:
       data = json.loads(resp.readall().decode('utf-8'))
   return map(lambda item: item['link'], data['data'])

def download_link(directory, link):
   logger.info('Downloading %s', link)
   download_path = directory / os.path.basename(link)
   with urlopen(link) as image, download_path.open('wb') as f:
       f.write(image.readall())

def setup_download_dir():
   download_dir = Path('images')
   if not download_dir.exists():
       download_dir.mkdir()
   return download_dir

接下来，我们将会写一个模块并使用这些函数去下载这些图片，一个接着一个，我们将会将其命名成”simple.py”,这个首先将会包含主函数，最初的Imgur图片下载器。这个模块将会从环境变量”IMGUR_CLIENT_ID”中接收Imgur的客户端ID，它将会在”setup_download_dir”中被调用去创建下载目标目录，最终，它会获取一个图片列表使用”get_links”函数，过滤所有的GIF和唱片URLs，然后使用”download_link”区下载以及保存每一张图片到硬盘中，这个”single.py”看起来如下:

import logging
import os
from time import time

from download import setup_download_dir, get_links, download_link

logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logging.getLogger('requests').setLevel(logging.CRITICAL)
logger = logging.getLogger(__name__)

def main():
   ts = time()
   client_id = os.getenv('IMGUR_CLIENT_ID')
   if not client_id:
       raise Exception("Couldn't find IMGUR_CLIENT_ID environment variable!")
   download_dir = setup_download_dir()
   links = [l for l in get_links(client_id) if l.endswith('.jpg')]
   for link in links:
       download_link(download_dir, link)
   print('Took {}s'.format(time() - ts))

if __name__ == '__main__':
   main()

在我的笔记本上，这个脚本下载91张图片使用了19.4秒，请注意这些数字将因你的网络环境而异。但是我们我们想要下载更多的图片呢？或者900张，而不是90张，每一张图片平均使用0.2秒，900张图片将会使用大约3分钟，那么9000张图片将会使用掉30分钟。好消息是通过并发或并行介绍，我们可以显著的加速。

所有后续的代码示例将只会导入新的和特定的导入语句，为了方便，这些Python脚本可以在这个GitHub仓库找到。

使用多线程进行并发和并行

线程时最为熟知的方法实现Python并发与并行。线程通常由操作系统提供的功能。线程比进程更为轻量级，并且共用同一内存空间。

在我们的Python线程教程中，我们将会写一个新的模块代替”single.py”,这个模块将会创建一个8个线程的池，总共包含主线程一共9个线程。我选择8个线程，因为我的电脑有8个CPU以及每个工作线程运行在每个CPU核心上同时运行看起来会是一个好的选择，在练习上，这个数字是基于其他因素选择得应更为仔细，例如其他应用以及服务运行在同一台机器上。

这个和之前的一个非常相似，除了我们拥有了一个新类，DownloadWorker, 是Thread类下的一个子类，运行方法被重写了。它运行一个无限循环，在每一次的迭代上，它会调用”self.queue.get”去试着从线程安全的队列中获取一个URL，它会阻塞直到队列中有其他worker处理。一旦worker从队列中接收到一条，它接下来将会调用和之前脚本相同的”download_link”方法去下载图片以及保存到图片目录。在下载完成后，worker通知队列这个任务已经完成，这是非常重要的，因为队列跟踪有多少任务入队。然后调用”queue.join()”将会阻塞主线程如果workers没有通知它们已经完成了任务。

from queue import Queue
from threading import Thread

class DownloadWorker(Thread):
   def __init__(self, queue):
       Thread.__init__(self)
       self.queue = queue

   def run(self):
       while True:
           # Get the work from the queue and expand the tuple
           directory, link = self.queue.get()
           download_link(directory, link)
           self.queue.task_done()

def main():
   ts = time()
   client_id = os.getenv('IMGUR_CLIENT_ID')
   if not client_id:
       raise Exception("Couldn't find IMGUR_CLIENT_ID environment variable!")
   download_dir = setup_download_dir()
   links = [l for l in get_links(client_id) if l.endswith('.jpg')]
   # Create a queue to communicate with the worker threads
   queue = Queue()
   # Create 8 worker threads
   for x in range(8):
       worker = DownloadWorker(queue)
       # Setting daemon to True will let the main thread exit even though the workers are blocking
       worker.daemon = True
       worker.start()
   # Put the tasks into the queue as a tuple
   for link in links:
       logger.info('Queueing {}'.format(link))
       queue.put((download_dir, link))
   # Causes the main thread to wait for the queue to finish processing all the tasks
   queue.join()
   print('Took {}'.format(time() - ts))

运行这个脚本在同一台机器上会更快的获取结果，只用了4.1秒!和之前的相比快了4.7倍。虽然这要快很多，但是值得提及的是由于GIL的原因同一时刻只有一个线程被执行，因此，这个代码是并发的但是不是并行的，相比之前快的原因在于这是一个IO绑定任务。处理器在下载这些图片时并不需要太耗费力气。大部分时间使用在等待网络上。这也是为什么线程可以提供大幅度的速度提升。处理器可以切换上下文在这些线程无论其中的哪一个准备去做一些任务。在Python或者任务其他带有GIL解释性语言使用线程模块实际上可能导致性能下降，如果你的代码是CPU绑定任务，例如解压文件，使用多线程模块将会使结果更慢，对于CPU绑定的任务来讲和真正的并行执行，我们将会使用multiprocessing模块。

事实上参考Python的实现，CPython，拥有GIL，不是所有的Python解释器都是这样。例如，IronPython,一个是使用.NET 框架实现的Python解释器，并不会有GIL，诸如Jython,基于Java实现的，你可以在这找到一系列Python解释器。

使用多进程

多进程模块比多线程模块更容易使用，因为我们不需要添加类似于线程示例的类。唯一改变的是主函数。

使用多进程我们可以创建一个进程池，使用提供的map方法，我们可以将列表中的URLs放入池中。将会产生8个新进程并使用每个进程并行下载这些图片，这是真正的并行。但它也带来的成本。脚本的整个内存被复制到各个子进程中。在这个简单的例子中，这不是什么大不了的事情，但是它可能很容易产生严重开销。

from functools import partial
from multiprocessing.pool import Pool

def main():
   ts = time()
   client_id = os.getenv('IMGUR_CLIENT_ID')
   if not client_id:
       raise Exception("Couldn't find IMGUR_CLIENT_ID environment variable!")
   download_dir = setup_download_dir()
   links = [l for l in get_links(client_id) if l.endswith('.jpg')]
   download = partial(download_link, download_dir)
   with Pool(8) as p:
       p.map(download, links)
   print('Took {}s'.format(time() - ts))

分配给多个worker

虽然线程和进程脚本都非常适合运行在个人电脑上，如果你想运行在不同的机器上你应该怎么做呢？或者向上扩展增加更多的CPU。对于长期运行的web项目来讲会是一个很好的用例。如果你有长时间运行任务，你不想在同一台机器上启动一堆需要运行的应用程序在多进程或多线程上。这将会降低你的应用性能对所有的你的用户，最好能够在另外一台或多台上运行这些任务。

对于这种任务一个出色的Python库是RQ,一个非常简单但是又非常功能强大的库，你首先需要确定函数和它的参数,pickles将会被调用，并将其加入到Redis列表中，声明工作排队是第一步，但是他并不会做任何事情，我们至少需要一个worker去监听这个工作队列。

第一步我们需要安装和运行Redis服务在你的个人电脑上，或者拥有访问Redis服务的权限，然后，只会在原有的代码基础上稍作改动，我们首先创建一个RQ队列实例并且连接到Redis服务上通过redis-py library,然后，代替刚才调用的”download_link”方法，我们使用“q.enqueue(download_link, download_dir, link)”。这个队列将函数作为它的第一个参数，然后其他参数将会带入到这个函数直到这个任务被执行。

最后一步我们需要启动这些worker,RQ提供了一系列的脚本去运行这些workers在默认的队列上，只需要在终端是那个执行”rqworker”然后它会启动一个worker监听这个默认的队列，请确保你目前的工作目录和这些脚本是在同一目录中，如果你想监听不同的队列，你可以运行 “rqworker queue_name”然后它就会监听倍命名的队列。关于RQ最出色的地方在于只要你连接到Redis,你可以运行任意多个worker在不同的机器上。这是非常容易扩展的。这个是RQ的版本:

from redis import Redis
from rq import Queue

def main():
   client_id = os.getenv('IMGUR_CLIENT_ID')
   if not client_id:
       raise Exception("Couldn't find IMGUR_CLIENT_ID environment variable!")
   download_dir = setup_download_dir()
   links = [l for l in get_links(client_id) if l.endswith('.jpg')]
   q = Queue(connection=Redis(host='localhost', port=6379))
   for link in links:
       q.enqueue(download_link, download_dir, link)

然而，RQ不是唯一的Python工作队列解决方式，RQ是简单使用并且非常好的覆盖简单用例，如果需要其他高级的特性，其他工作队列，如Celery可以被使用。

结论

如果你的代码是IO绑定的，multiprocessing和multithreading都适合你，多进程比多线程更容易使用，但是会占用更多的内存，如果你的代码是CPU绑定的，那么multiprocessing可能是更优的选择。尤其目标机器拥有多个CPU，对于web应用，如果想要扩展worker,RQ会是更好的选择。

此片文章翻译自:beginners-guide-to-concurrency-and-parallelism-in-python.

展开全文 >>

不同用户的多条操作时间差值平均值

2018-04-10

例如: 一站表

id	uid	login_time
1	1	2018-04-10 18:19:14
2	1	2019-04-10 18:19:14
3	1	2020-04-10 18:19:14
4	2	2019-04-10 18:19:14

那么，最终结果类似于:

uid	average
1	(2020-04-10 18:19:14 - 2018-04-10 18:19:14) / 2

1
2
3


select uid, timestampdiff(day, min(login_time), max(login_time)) / (count(*) - 1) from table_name group by uid having count(*) > 1;

需要买本高级查询的书了-_-!

展开全文 >>

SQLAlchemy字段类型

2018-04-10

基本类型

字段类型是在定义模型时, 对每个Column的类型约定. 不同类型的字段类型在输入输出上, 及支持的操作方面, 有所区别.
这里只介绍sqlalchemy.types.*中的类型, SQL 标准类型方面, 是写什么最后生成的DDL语句就是什么, 比如 BIGINT, BLOG 这些, 但是这些类型并不一定在所有数据库中都有支持. 除此而外, SQLAlchemy 也支持一些特定数据库的特定类型, 这些需要从具体的 dialects 实现里导入.

Integer/BigInteger/SmallInteger

整形.

Boolean

布尔类型. Python 中表现为 True/False , 数据库根据支持情况, 表现为 BOOLEAN 或SMALLINT . 实例化时可以指定是否创建约束(默认创建).

Date/DateTime/Time (timezone=False)

日期类型, Time 和 DateTime 实例化时可以指定是否带时区信息.

Interval

时间偏差类型. 在 Python 中表现为 datetime.timedelta() , 数据库不支持此类型则存为日期.

Enum (*enums, **kw)

枚举类型, 根据数据库支持情况, SQLAlchemy 会使用原生支持或者使用 VARCHAR 类型附加约束的方式实现. 原生支持中涉及新类型创建, 细节在实例化时控制.

Float

浮点小数.

Numeric (precision=None, scale=None, decimal_return_scale=None, …)

定点小数, Python 中表现为 Decimal .

LargeBinary (length=None)

字节数据. 根据数据库实现, 在实例化时可能需要指定大小.

PickleType

Python 对象的序列化类型.

String (length=None, collation=None, …)

字符串类型, Python 中表现为 Unicode , 数据库表现为 VARCHAR , 通常都需要指定长度.

Unicode

类似与字符串类型, 在某些数据库实现下, 会明确表示支持非 ASCII 字符. 同时输入输出也强制是 Unicode 类型.

Text

长文本类型, Python 表现为 Unicode , 数据库表现为 TEXT .

UnicodeText

参考 Unicode .

展开全文 >>

推入线程池/进程池

异步处理

同步处理

常用以及常看见的写法

给装饰器传参

wrapper callback

flask-like decorator

第一步

第二步

总结

示例

where to patch

常见的一些mock

概念

关于容错率

相关的包

后续

直接行为

表达式行为

应用于关系

开始在Python上使用多线程

使用多线程进行并发和并行

使用多进程

分配给多个worker

结论

基本类型

Integer/BigInteger/SmallInteger

Boolean

Date/DateTime/Time (timezone=False)

Interval

Enum (*enums, **kw)

Float

Numeric (precision=None, scale=None, decimal_return_scale=None, …)

LargeBinary (length=None)

PickleType

String (length=None, collation=None, …)

Unicode

Text

UnicodeText