跳转到主要内容

适用于有限内存排序的通用工具。

项目描述

受Spotify的luigi框架启发的实验性内存Pythonic MapReduce。

https://travis-ci.org/geowurster/tinymr.svg?branch=master https://coveralls.io/repos/geowurster/tinymr/badge.svg?branch=master

单词计数示例

目前唯一的MapReduce实现是内存和串行的,但将添加并行化映射和归约阶段的实现。

from collections import Counter
import json
import re
import sys

from tinymr.memory import MemMapReduce


class WordCount(MemMapReduce):

    """
    The go-to MapReduce example.

    Don't worry, a better example is on its way:
    https://github.com/geowurster/tinymr/issues/11
    """

    # Counting word occurrence does not benefit from sorting post-map or
    # post-reduce and our `final_reducer()` doesn't care about key order
    # so we can disable sorting for a speed boost.
    sort_map = False
    sort_reduce = False
    sort_final_reduce = False

    def __init__(self):

        """
        Stash a regex to strip off punctuation so we can grab it later.
        """

        self.pattern = re.compile('[\W_]+')

    def mapper(self, line):

        """
        Take a line of text from the input file and figure out how many
        times each word appears.

        An alternative, simpler implementation would be:

            def mapper(self, item):
                for word in item.split():
                    word = self.pattern.sub('', word)
                    if word:
                        yield word.lower(), 1

        This simpler example is more straightforward, but holds more data
        in-memory.  The implementation below does more work per line but
        potentially has a smaller memory footprint.  Like anything
        MapReduce the implementation benefits greatly from knowing a lot
        about the input data.
        """

        # Normalize all words to lowercase
        line = line.lower().strip()

        # Strip off punctuation
        line = [self.pattern.sub('', word) for word in line]

        # Filter out empty strings
        line = filter(lambda x: x != '', line)

        # Return an iterable of `(word, count)` pairs
        return Counter(line).items()

    def reducer(self, key, values):

        """
        Just sum the number of times each word appears in the entire
        dataset.

        At this phase `key` is a word and `values` is an iterable producing
        all of the values for that word from the map phase.  Something like:

            key = 'the'
            values = (1, 1, 2, 2, 1)

        The word `the` appeared once on 3 lines and twice on two lines for
        a total of `7` occurrences, so we yield:

            ('the', 7)
        """

        yield key, sum(values)

    def output(self, pairs):

        """
        Normally this phase is where the final dataset is written to disk,
        but since we're operating in-memory we just want to re-structure as
        a dictionary.

        `pairs` is an iterator producing `(key, iter(values))` tuples from
        the reduce phase, and since we know that we only produced one key
        from each reduce we want to extract it for easier access later.
        """

        return {k: tuple(v)[0] for k, v in pairs}


wc = WordCount()
with open('LICENSE.txt') as f:
    out = wc(f)
    print(json.dumps(out, indent=4, sort_keys=True))

输出截断

{
    "a": 1,
    "above": 2,
    "advised": 1,
    "all": 1,
    "and": 8,
    "andor": 1
}

单词计数工作流程

内部工作流程如下

输入数据:

$ head -10 LICENSE.txt

New BSD License

Copyright (c) 2015, Kevin D. Wurster
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.

映射

计算每一行中每个单词的出现次数。

# Input line
line = 'Copyright (c) 2015, Kevin D. Wurster'

# Sanitized words
words = ['Copyright', 'c', '2015', 'Kevin', 'D', 'Wurster']

# Return tuples with word as the first element and count as the second
pairs = [('Copyright', 1), ('c', 1), ('2015', 1), ('Kevin', 1), ('D', 1), ('Wurster', 1)]

分区

根据单词对所有的(word, count)对进行组织。单词键在此处保留,以防数据排序。排序会抓取倒数第二个键,因此数据可以按一个键分区并按另一个键排序,即使用(word, sort, count)。倒数第二个键用于排序,因此出现在下面的键与单词匹配仅因为未提供排序键。

在输入文本的多个行中出现的单词具有多个(word, count)对。计数为2的计数表示一个单词在单行中出现了两次,但我们的输入数据没有这种条件。以下为截断输出。字典值是包含元组的列表,以允许有排序键,这将在其他地方进行解释。

{
    '2015': [(1,)]
    'above': [(1,)]
    'all': [(1,)]
    'and': [(1,), (1,), (1,)]
    'are': [(1,), (1,)]
    'binary': [(1,)]
    'bsd': [(1,)]
    'c': [(1,)]
    'code': [(1,)]
}

归约

计算每个单词的累计数量。

# The ``reducer()`` receives a key and an iterator of values
key = 'the'
values = (1, 1, 1)
def reducer(key, values):
    yield key, sum(values)

分区

合并器不一定要输出与输入相同的键,因此数据再次按键分区,这在wordcount示例中是多余的。再次保留键,以防数据排序,并且只匹配单词,因为没有提供可选的排序键。以下为截断的输出。

{
    '2015': [(1,)]
    'above': [(1,)]
    'all': [(1,)]
    'and': [(3,)]
    'are': [(2,)]
    'binary': [(1,)]
    'bsd': [(1,)]
    'c': [(1,)]
    'code': [(1,)]
}

输出

默认实现是从final_reducer()返回(key, iter(values))对,这看起来像

values = [
    ('the', (3,)),
    ('in', (1,),
]

但字典更有用,我们知道在reduce阶段我们只得到了每个单词的一个值,因此我们可以提取该值并生成一个字典。

return {k: tuple(v)[0] for k, v in values}

包括tuple()调用是因为value键中的数据总是可迭代的对象,但可能是迭代器。调用tuple()扩展了可迭代对象,使我们能够获取第一个元素。

开发中

$ git clone https://github.com/geowurster/tinymr.git
$ cd tinymr
$ pip install -e .\[dev\]
$ py.test tests --cov tinymr --cov-report term-missing

许可证

LICENSE.txt

变更日志

CHANGES.md

项目详情


发布历史 发布通知 | 订阅RSS源

下载文件

下载适合您平台的文件。如果您不确定选择哪个,请了解有关安装包的更多信息。

源分布

tinysort-0.1.tar.gz (29.0 kB 查看哈希值)

上传时间

构建分布

tinysort-0.1-py2.py3-none-any.whl (26.1 kB 查看哈希值)

上传时间 Python 2 Python 3

支持者