UNIX命令行工具,用于基于Python的行流处理
项目描述
作者: Pahaz White
仓库: https://github.com/pahaz/py3line/
Pyline是一个UNIX命令行工具,用于bash单行脚本。它是grep、sed和awk的Python行替代品。
本项目的灵感来源于:pyfil、piep、pysed、pyline、pyp和Jacob+Mark配方
为什么我要制作它?
有时候,我必须使用sed / awk / grep。通常用于简单的文本处理。在文本文件中找到一些模式,使用Python正则表达式,或通过bash单行命令注释/取消注释一些配置行。
我总是忘记必要的选项和sed / awk DSL。但现在我喜欢Python,我想用它来完成这些简单的bash任务。默认的python -c不足以编写可读的bash单行命令。
- 为什么不是pyline?
- 不支持python3 
- 有很多选项 
- 不支持命令链 
 
原则
尽可能简单易懂的bash单行脚本
尽可能少的脚本参数
尽可能容易安装(容器友好?)
代码库尽可能小(小于500 loc)
尽可能懒惰且有效
安装
py3line在PyPI上,所以只需运行
pip install py3line
或
sudo curl -L "https://61-63976011-gh.circle-artifacts.com/0/py3line-$(uname -s)-$(uname -m)" -o /usr/local/bin/py3line sudo chmod +x /usr/local/bin/py3line
将其安装到您的环境中。
要从源代码安装,请克隆仓库并运行
python setup.py install
教程
让我们从示例开始,我们想评估每行中的多个单词
$ echo -e "Here are\nsome\nwords for you." | ./py3line.py "x = len(line.split(' ')); print(x, line)"
2 Here are
1 some
3 words for you.Py3line逐行处理输入流。
echo -e “这里有一些单词供您参考。” – 创建一个包含三行的输入流
| – 将输入流管道化到py3line
“x = len(line.split()); print(x, line)” – 定义2个操作:“x = len(line.split(‘ ‘))”评估每行中的单词数,然后“print(x, line)”打印结果。每个操作逐步应用于输入流。
上述示例可以表示为以下Python代码
import sys
def process(stream):
    for line in stream:
        x = len(line.split(' '))  # action 1
        print(x, line)            # action 2
        yield line
stream = (line.rstrip("\r\n") for line in sys.stdin if line)
stream = process(stream)
for line in stream: pass
您也可以通过--pycode参数获取执行的Python代码。
$ ./py3line.py "x = len(line.split(' ')); print(x, line)" --pycode  #skipbashtest
...流转换
让我们尝试更复杂的示例,我们想评估整个文件中的单词数。如果您将输入流从行流转换为每行的单词数流,则此值很容易计算。只需覆盖line变量
$ echo -e "Here are\nsome\nwords for you." | ./py3line.py "line = len(line.split()); print(sum(stream))" 6
这里我们有一个流转换操作“print(sum(stream))”。
上述示例可以表示为以下Python代码
import sys
def process(stram):
    for line in stream:
        line = len(line.split())  # action 1
        yield line
def transform(stream):
    print(sum(stream))            # action 2
    return stream
stream = (line.rstrip("\r\n") for line in sys.stdin if line)
stream = transform(process(stream))
for line in stream: pass
您也可以通过--pycode参数获取执行的Python代码。
$ ./py3line.py "line = len(line.split()); print(sum(stream))" --pycode  #skipbashtest
...尽可能懒惰
Py3line通过使用Python生成器仅在必要时进行计算。这意味着输入流不会放入内存,您可以轻松处理比RAM允许更多的数据。
但这也会限制与数据流一起工作的能力。您不能同时使用多个聚合函数。例如,如果我们想同时计算行中的最大单词数和整个文件中的单词总数。
$ echo -e "Here are\nsome\nwords for you." | ./py3line.py "line = len(line.split()); print(sum(stream)); print(max(stream))"  #skipbashtest
6
2019-05-05 14:55:09,353 | ERROR   | Traceback (most recent call last):
  File "<string>", line 15, in <module>
    stream = transform2(process1(stream))
  File "<string>", line 10, in transform2
    print(max(stream))
ValueError: max() arg is an empty sequence
我们可以看到empty sequence错误。它抛出,因为我们已耗尽stream生成器。我们在空流中找不到任何最大值。
流存储
我们可以通过使用Python的list(stream)函数将stream生成器转换为内存中值的列表来解决这个问题。
$ echo -e "Here are\nsome\nwords for you." | ./py3line.py "line = len(line.split()); stream = list(stream); print(sum(stream), max(stream))" 6 3
上述示例可以表示为以下Python代码
import sys
def process(stram):
    for line in stream:
        line = len(line.split())     # action 1
        yield line
def transform(stream):
    stream = list(stream)            # action 2
    print(sum(stream), max(stream))  # action 3
    return stream
stream = (line.rstrip("\r\n") for line in sys.stdin if line)
stream = transform(process(stream))
for line in stream: pass
即时评估
我们还可以在不将流放入内存的情况下解决它。只需在处理流的过程中使用辅助变量,我们将在此过程中放置计算结果。
$ echo -e "Here are\nsome\nwords for you." | ./py3line.py "s = 0; m = 0; num_of_words = len(line.split()); s += num_of_words; m = max(m, num_of_words); print(s, m)" 2 2 3 2 6 3
上述示例可以表示为以下Python代码
import sys
def process(stram):
    s = 0                                 # action 1
    m = 0                                 # action 2
    for line in stream:
        num_of_words = len(line.split())  # action 3
        s += num_of_words                 # action 4
        m = max(m, num_of_words)          # action 5
        print(s, m)                       # action 6
        yield line
stream = (line.rstrip("\r\n") for line in sys.stdin if line)
stream = process(stream)
for line in stream: pass
但我们只想看到最后一个结果。我们不希望看到中间结果。为此,您可以在打印之前通过for line in stream: pass遍历流的所有元素。别担心,这个循环不会添加不必要的计算,因为我们使用Python语言生成器。该循环将简单地强制流在调用打印函数之前迭代。
$ echo -e "Here are\nsome\nwords for you." | ./py3line.py "s = 0; m = 0; num_of_words = len(line.split()); s += num_of_words; m = max(m, num_of_words); for line in stream: pass; print(s, m)" 6 3
上述示例可以表示为以下Python代码
import sys
def process(stram):
    global s, m
    s = 0                                 # action 1
    m = 0                                 # action 2
    for line in stream:
        num_of_words = len(line.split())  # action 3
        s += num_of_words                 # action 4
        m = max(m, num_of_words)          # action 5
        yield line
def transform(stream):
    global s, m
    for line in stream: pass              # action 6
    print(s, m)                           # action 7
    return stream
stream = (line.rstrip("\r\n") for line in sys.stdin if line)
stream = transform(process(stream))
for line in stream: pass
Python生成器懒惰
让我们检查Python生成器的懒惰。只需连续运行for line in stream: print(1);两次
$ echo -e "Here are\nsome\nwords for you." | ./py3line.py "for line in stream: print(1); for line in stream: print(1)" 1 1 1
正如我们所见,它只会一次性迭代Python生成器项。所有后续迭代都将与空生成器一起工作,这相当于遍历空列表。
上述示例可以表示为以下Python代码
import sys
def transform(stream):
    for line in stream: pass              # action 1 (3 iterations)
    for line in stream: pass              # action 2 (0 iterations)
    return stream
stream = (line.rstrip("\r\n") for line in sys.stdin if line)
stream = transform(stream)
for line in stream: pass                  # (0 iterations)
处理流的一部分
待办事项…
详细信息
让我们定义一些术语。 py3line “action1; action2; action3
我们有操作:action1、action2和action3。每个操作都有类型。它可能是element processing或stream transformation。
我们可以根据它使用的变量来理解操作的类型。我们有两个变量:line和stream。它们是定义操作类型的标记。
让我们看看上面示例中的某些类型
x = line.split() -- element processing print(x, line) -- element processing print(sum(stream)) -- stream transformation stream = list(stream) -- stream transformation print(sum(stream), max(stream)) -- stream transformation s = 0 -- unidentified m = 0 -- unidentified num_of_words = len(line.split()) -- element processing s += num_of_words -- unidentified m = max(m, num_of_words) -- unidentified print(s, m) -- unidentified for line in stream: pass -- stream transformation
[规则1] 如果一个操作类型未定义,它将从上一个操作继承其类型。[规则2] 如果没有上一个操作,则该操作被视为流转换。
示例
s = 0 -- stream transformation (because of [rule2]) num_of_words = len(line.split()) -- element processing (because of `line` marker) s += num_of_words -- element processing (because of [rule1]) print(s) -- element processing (because of [rule1])
如果我们想在最后打印,我们应该在操作前有一些流标记。
s = 0 -- stream transformation (because of [rule2]) num_of_words = len(line.split()) -- element processing (because of `line` marker) s += num_of_words -- element processing (because of [rule1]) stream -- stream transformation (because of `stream` marker) print(s) -- stream transformation (because of [rule1])
不幸的是,对于不熟悉实现的人来说并不那么明显。因此,最好使用更明确的供读者操作,例如 for line in stream: pass。
s = 0 -- stream transformation (because of [rule2]) num_of_words = len(line.split()) -- element processing (because of `line` marker) s += num_of_words -- element processing (because of [rule1]) for line in stream: pass -- stream transformation (because of `stream` marker) print(s) -- stream transformation (because of [rule1])
一些示例
# Print every line (null transform)
$ cat ./testsuit/test.txt | ./py3line.py "print(line)"
This is my cat,
 whose name is Betty.
This is my dog,
 whose name is Frank.
This is my fish,
 whose name is George.
This is my goat,
 whose name is Adam.# Number every line
$ cat ./testsuit/test.txt | ./py3line.py "stream = enumerate(stream); print(line)"
(0, 'This is my cat,')
(1, ' whose name is Betty.')
(2, 'This is my dog,')
(3, ' whose name is Frank.')
(4, 'This is my fish,')
(5, ' whose name is George.')
(6, 'This is my goat,')
(7, ' whose name is Adam.')# Number every line
$ cat ./testsuit/test.txt | ./py3line.py "stream = enumerate(stream); print(line[0], line[1])"
0 This is my cat,
1  whose name is Betty.
2 This is my dog,
3  whose name is Frank.
4 This is my fish,
5  whose name is George.
6 This is my goat,
7  whose name is Adam.或者直接 cat ./testsuit/test.txt | ./py3line.py "stream = enumerate(stream); print(*line)"
# Print every first and last word
$ cat ./testsuit/test.txt | ./py3line.py "s = line.split(); print(s[0], s[-1])"
This cat,
whose Betty.
This dog,
whose Frank.
This fish,
whose George.
This goat,
whose Adam.# Split into words and print as list (strip al non word char like comma, dot, etc)
$ cat ./testsuit/test.txt | ./py3line.py "print(re.findall(r'\w+', line))"
['This', 'is', 'my', 'cat']
['whose', 'name', 'is', 'Betty']
['This', 'is', 'my', 'dog']
['whose', 'name', 'is', 'Frank']
['This', 'is', 'my', 'fish']
['whose', 'name', 'is', 'George']
['This', 'is', 'my', 'goat']
['whose', 'name', 'is', 'Adam']# Split into words (strip al non word char like comma, dot, etc)
$ cat ./testsuit/test.txt | ./py3line.py "print(*re.findall(r'\w+', line))"
This is my cat
whose name is Betty
This is my dog
whose name is Frank
This is my fish
whose name is George
This is my goat
whose name is Adam# Find all three letter words
$ cat ./testsuit/test.txt | ./py3line.py "print(re.findall(r'\b\w\w\w\b', line))"
['cat']
[]
['dog']
[]
[]
[]
[]
[]# Find all three letter words + skip empty lists
cat ./testsuit/test.txt | ./py3line.py "line = re.findall(r'\b\w\w\w\b', line); if not line: continue; print(line)"
['cat']
['dog']# Regex matching with groups
$ cat ./testsuit/test.txt | ./py3line.py "line = re.findall(r' is ([A-Z]\w*)', line); if not line: continue; print(*line)"
Betty
Frank
George
Adam# cat ./testsuit/test.txt | ./py3line.py "line = re.search(r' is ([A-Z]\w*)', line); if not line: continue; line.group(1)"
$ cat ./testsuit/test.txt | ./py3line.py "rgx = re.compile(r' is ([A-Z]\w*)'); line = rgx.search(line); if not line: continue; print(line.group(1))"
Betty
Frank
George
Adam# head -n 2
# cat ./testsuit/test.txt | ./py3line.py "stream = enumerate(stream); if line[0] >= 2: break; print(line[1])"
$ cat ./testsuit/test.txt | ./py3line.py "stream = list(stream)[:2]; print(line)"
This is my cat,
 whose name is Betty.# Print just the URLs in the access log
$ cat ./testsuit/nginx.log | ./py3line.py "print(shlex.split(line)[13])"
HEAD / HTTP/1.0
HEAD / HTTP/1.0
HEAD / HTTP/1.0
HEAD / HTTP/1.0
HEAD / HTTP/1.0
GET /admin/moktoring/session/add/ HTTP/1.1
GET /admin/jsi18n/ HTTP/1.1
GET /static/admin/img/icon-calendar.svg HTTP/1.1
GET /static/admin/img/icon-clock.svg HTTP/1.1
HEAD / HTTP/1.0
HEAD / HTTP/1.0
HEAD / HTTP/1.0
HEAD / HTTP/1.0
HEAD / HTTP/1.0
GET /logout/?reason=startApplication HTTP/1.1
GET / HTTP/1.1
GET /login/?next=/ HTTP/1.1
POST /admin/customauth/user/?q=%D0%9F%D0%B0%D1%81%D0%B5%D1%87%D0%BD%D0%B8%D0%BA HTTP/1.1# Print most common accessed urls and filter accessed more then 5 times
$ cat ./testsuit/nginx.log | ./py3line.py "line = shlex.split(line)[13]; stream = collections.Counter(stream).most_common(); if line[1] < 5: continue; print(line)"
('HEAD / HTTP/1.0', 10)复杂示例
# create directory tree
echo -e "y1\nx2\nz3" | ./py3line.py "pathlib.Path('/DATA/' + line +'/db-backup/').mkdir(parents=True, exist_ok=True)"
group by 3 lines ... (https://askubuntu.com/questions/1052622/separate-log-text-according-to-paragraph)帮助
$ ./py3line.py --help
usage: py3line.py [-h] [-v] [-q] [--version] [--pycode]
                  [expression [expression ...]]
Py3line is a UNIX command-line tool for a simple text stream processing by the
Python one-liner scripts. Like grep, sed and awk.
positional arguments:
  expression     python comma separated expressions
optional arguments:
  -h, --help     show this help message and exit
  -v, --verbose
  -q, --quiet
  --version      print the version string
  --pycode       show generated python code
$ ./py3line.py --version 0.3.1
项目详情
py3line-0.3.1.tar.gz 的哈希值
| 算法 | 哈希摘要 | |
|---|---|---|
| SHA256 | 1f133d2ccbbcb286602494c3d3f548ec80b780de6cd90754f028d7c2c3970715 | |
| MD5 | d5a50c27fb7164b2f0f8cc747cd9451d | |
| BLAKE2b-256 | c2b76424925d630d647c751d2feeb3c917a8d97213857d6270edd2ee56539727 |