跳转到主要内容

datafactory生成测试数据。

项目描述

https://badge.fury.io/py/datafactory.svg https://github.com/walkframe/datafactory/workflows/master/badge.svg https://img.shields.io/badge/code%20style-black-000000.svg https://codecov.io/gh/walkframe/datafactory/branch/master/graph/badge.svg https://img.shields.io/badge/License-Apache%202.0-blue.svg

概述

datafactory根据给定的规则创建灵活的数据。

功能分为字段模型容器格式化器。如果您将其与数据库进行比较,字段是列,模型是记录,容器是表。

datafactory的出色之处在于其在类型指定方面的灵活性。容器还可以嵌套。

格式化器支持数据格式化和文件输出。

需求

  • Python 3.5或更高版本。

安装

$ pip install datafactory

用法

基本示例

In [1]: import datafactory

In [2]: model = datafactory.Model({
   ...:     'id': datafactory.IncrementField(),
   ...:     'x': datafactory.CycleField(['a', 'b', 'c']),
   ...:     # BLANK will be omit.
   ...:     'option': datafactory.ChoiceField([True, False, datafactory.BLANK]),
   ...: })

In [3]: container = datafactory.Container(model, 5, render=True)

In [4]: container
Out[4]:
[{'id': 1, 'x': 'a'},
 {'id': 2, 'x': 'b', 'option': False},
 {'id': 3, 'x': 'c', 'option': True},
 {'id': 4, 'x': 'a'},
 {'id': 5, 'x': 'b'}]

# specify rewrite=True, if file already exists.
In [5]: datafactory.JsonFormatter(container).write('/tmp/test.json', rewrite=True)

In [6]: !cat /tmp/test.json
[
 {
  "x": "a",
  "id": 1
 },
 {
  "x": "b",
  "id": 2,
  "option": false
 },
 {
  "x": "c",
  "id": 3,
  "option": true
 },
 {
  "x": "a",
  "id": 4
 },
 {
  "x": "b",
  "id": 5
 }
]

TSV示例

In [1]: import datafactory

In [2]: model = datafactory.ListModel([
   ...:     datafactory.IncrementField(start=10, step=5),
   ...:     datafactory.HashOfField(2, 'md5'),  # hashing value of the third column.
   ...:     datafactory.ChoiceField(['foo', 'bar', 'baz']),
   ...:     datafactory.CycleField(range(0, 30, 10)),
   ...: ]).ordering(2)  # render at first index:2(third column)

# IterContainer is saving memory, because generating an element each time.
In [3]: container = datafactory.IterContainer(model, 10)  # repeat 10 times.

In [4]: datafactory.CsvFormatter(
   ...:     container,
   ...:     delimiter='\t',
   ...:     header=['id', 'hash-of-name', 'name', 'value']
   ...: ).write('/tmp/test.csv', rewrite=True)

In [5]: !cat /tmp/test.csv
id    hash-of-name    name    value
10    acbd18db4cc2f85cedef654fccc4a4d8        foo     0
15    acbd18db4cc2f85cedef654fccc4a4d8        foo     10
20    73feffa4b7f6bb68e44cf984c85f6e88        baz     20
25    acbd18db4cc2f85cedef654fccc4a4d8        foo     0
30    acbd18db4cc2f85cedef654fccc4a4d8        foo     10
35    73feffa4b7f6bb68e44cf984c85f6e88        baz     20
40    73feffa4b7f6bb68e44cf984c85f6e88        baz     0
45    73feffa4b7f6bb68e44cf984c85f6e88        baz     10
50    37b51d194a7513e45b56f6524f2d51f2        bar     20
55    37b51d194a7513e45b56f6524f2d51f2        bar     0

自定义示例

如果对象是可调用的,则存储执行结果。

模型

In [1]: import datafactory

In [2]: def square(k, i):
   ...:     return k * i
   ...:

In [3]: container = datafactory.DictContainer(square)

In [4]: container(['a', 'b', 'c', 'd', 'e'])
Out[4]: {'a': '', 'b': 'b', 'c': 'cc', 'd': 'ddd', 'e': 'eeee'}

字段

In [1]: import datafactory

In [2]: model = datafactory.Model({
   ...:    'col1': (lambda r, i: i),
   ...:    'col2': (lambda r: r['col1'] + 1),
   ...:    'col3': (lambda r: r['col2'] * 2),
   ...:    'col4': 100,  # fixed value
   ...: }).ordering('col1', 'col2', 'col3')

In [3]: container = datafactory.ListContainer(model)

In [4]: container(4)
Out[4]:
[{'col1': 0, 'col2': 1, 'col3': 2, 'col4': 100},
 {'col1': 1, 'col2': 2, 'col3': 4, 'col4': 100},
 {'col1': 2, 'col2': 3, 'col3': 6, 'col4': 100},
 {'col1': 3, 'col2': 4, 'col3': 8, 'col4': 100}]

元素数量限制示例

In [1]: import datafactory

In [2]: model = datafactory.Model({
   ...:     # x: a is 1times limited. / b is 2times limited. / c is 3times limited.
   ...:     'x': datafactory.PickoutField({'a': 1, 'b': 2, 'c': 3}, missing=None),
   ...:     # y: a is 2times limited. / b and c is 1times limited.
   ...:     'y': datafactory.PickoutField(['a', 'a', 'b', 'c'], missing='*'),
   ...:     # z: a and b can't be selected. / c is 5times limited.
   ...:     'z': datafactory.PickoutField(['c']*5, missing=None),
   ...: })

In [3]: container = datafactory.ListContainer(model)

In [4]: container(6)
Out[4]:
[{'x': 'a', 'y': 'a', 'z': 'c'},
 {'x': 'c', 'y': 'b', 'z': 'c'},
 {'x': 'c', 'y': 'a', 'z': 'c'},
 {'x': 'b', 'y': 'c', 'z': 'c'},
 {'x': 'c', 'y': '*', 'z': 'c'},
 {'x': 'b', 'y': '*', 'z': None}]

组合示例

要生成结合多个元素的测试数据,可以通过使用CycleFieldSequenceField的重复参数来实现。

In [1]: import datafactory

In [2]: l0 = ['a', 'b']

In [3]: l1 = ['a', 'b', 'c']

In [4]: l2 = ['a', 'b', 'c', 'd']

In [5]: model = datafactory.ListModel([
   ...:     datafactory.SequenceField(l0, repeat=len(l1)*len(l2), missing=datafactory.ESCAPE),
   ...:     datafactory.CycleField(l1, repeat=len(l2)),
   ...:     datafactory.CycleField(l2),
   ...: ])

In [6]: container = datafactory.Container(model)

# by specifying the ESCAPE to missing-argument
# automatically detect end of elements and escape before reaching 10000.
In [7]: container(10000)
Out[7]:
[['a', 'a', 'a'],
 ['a', 'a', 'b'],
 ['a', 'a', 'c'],
 ['a', 'a', 'd'],
 ['a', 'b', 'a'],
 ['a', 'b', 'b'],
 ['a', 'b', 'c'],
 ['a', 'b', 'd'],
 ['a', 'c', 'a'],
 ['a', 'c', 'b'],
 ['a', 'c', 'c'],
 ['a', 'c', 'd'],
 ['b', 'a', 'a'],
 ['b', 'a', 'b'],
 ['b', 'a', 'c'],
 ['b', 'a', 'd'],
 ['b', 'b', 'a'],
 ['b', 'b', 'b'],
 ['b', 'b', 'c'],
 ['b', 'b', 'd'],
 ['b', 'c', 'a'],
 ['b', 'c', 'b'],
 ['b', 'c', 'c'],
 ['b', 'c', 'd']]

嵌套示例

In [1]: import datafactory

In [2]: model = datafactory.Model({
   ...:     'a': datafactory.ListModel([
   ...:         datafactory.CycleField(['b', 'c']),
   ...:         datafactory.CycleField(['d', 'e']),
   ...:     ]),
   ...:     datafactory.ChoiceField(['f', 'g', 'h']): datafactory.DictContainer(lambda x: x * 2, 5)
   ...: })

In [3]: datafactory.Container(model, 10, render=True)
Out[3]:
[{'a': ['b', 'd'], 'h': {0: 0, 1: 2, 2: 4, 3: 6, 4: 8}},
 {'a': ['c', 'e'], 'f': {0: 0, 1: 2, 2: 4, 3: 6, 4: 8}},
 {'a': ['b', 'd'], 'f': {0: 0, 1: 2, 2: 4, 3: 6, 4: 8}},
 {'a': ['c', 'e'], 'g': {0: 0, 1: 2, 2: 4, 3: 6, 4: 8}},
 {'a': ['b', 'd'], 'f': {0: 0, 1: 2, 2: 4, 3: 6, 4: 8}},
 {'a': ['c', 'e'], 'h': {0: 0, 1: 2, 2: 4, 3: 6, 4: 8}},
 {'a': ['b', 'd'], 'g': {0: 0, 1: 2, 2: 4, 3: 6, 4: 8}},
 {'a': ['c', 'e'], 'h': {0: 0, 1: 2, 2: 4, 3: 6, 4: 8}},
 {'a': ['b', 'd'], 'h': {0: 0, 1: 2, 2: 4, 3: 6, 4: 8}},
 {'a': ['c', 'e'], 'h': {0: 0, 1: 2, 2: 4, 3: 6, 4: 8}}]

datetime实用工具

选择

在开始和结束之间随机选择。

In [1]: from datafactory.utils.datetime import choice


In [2]: choice(1988, '2015-11-11T11:11:11.111111')
Out[2]: datetime.datetime(2009, 11, 30, 23, 25, 43, 240031)

# tuple: datetime(*tuple), dict: datetime(**dict)
In [3]: choice((1988, 5, 22), {'year': 2015, 'month': 11, 'day': 11})
Out[3]: datetime.datetime(1996, 7, 1, 11, 14, 59, 314809)

In [4]: from datetime import datetime, date

In [5]: choice(date(1988, 5, 22), datetime(2015, 11, 11, 11, 11, 11))
Out[5]: datetime.datetime(2011, 3, 23, 19, 39, 14, 476901)

生成器

生成定时期隔的datetime对象的生成器。

In [1]: from datetime import timedelta
In [2]: from datafactory.utils.datetime import generator

# if you omit end-argument, then it creates an object infinitely.
In [3]: g = generator(start=2015, interval=timedelta(days=1, hours=12))

In [4]: next(g)
Out[4]: datetime.datetime(2015, 1, 1, 0, 0)

In [5]: next(g)
Out[5]: datetime.datetime(2015, 1, 2, 12, 0)

In [6]: next(g)
Out[6]: datetime.datetime(2015, 1, 4, 0, 0)

In [7]: next(g)
Out[7]: datetime.datetime(2015, 1, 5, 12, 0)

范围

生成包含定期生成的datetime对象的列表对象。

In [1]: from datetime import timedelta
In [2]: from datafactory.utils.datetime import range

In [3]: range(2015, '2015/2/1')
Out[3]:
[datetime.datetime(2015, 1, 1, 0, 0),
 datetime.datetime(2015, 1, 2, 0, 0),
 datetime.datetime(2015, 1, 3, 0, 0),
 datetime.datetime(2015, 1, 4, 0, 0),
 datetime.datetime(2015, 1, 5, 0, 0),
 datetime.datetime(2015, 1, 6, 0, 0),
 datetime.datetime(2015, 1, 7, 0, 0),
 datetime.datetime(2015, 1, 8, 0, 0),
 datetime.datetime(2015, 1, 9, 0, 0),
 datetime.datetime(2015, 1, 10, 0, 0),
 datetime.datetime(2015, 1, 11, 0, 0),
 datetime.datetime(2015, 1, 12, 0, 0),
 datetime.datetime(2015, 1, 13, 0, 0),
 datetime.datetime(2015, 1, 14, 0, 0),
 datetime.datetime(2015, 1, 15, 0, 0),
 datetime.datetime(2015, 1, 16, 0, 0),
 datetime.datetime(2015, 1, 17, 0, 0),
 datetime.datetime(2015, 1, 18, 0, 0),
 datetime.datetime(2015, 1, 19, 0, 0),
 datetime.datetime(2015, 1, 20, 0, 0),
 datetime.datetime(2015, 1, 21, 0, 0),
 datetime.datetime(2015, 1, 22, 0, 0),
 datetime.datetime(2015, 1, 23, 0, 0),
 datetime.datetime(2015, 1, 24, 0, 0),
 datetime.datetime(2015, 1, 25, 0, 0),
 datetime.datetime(2015, 1, 26, 0, 0),
 datetime.datetime(2015, 1, 27, 0, 0),
 datetime.datetime(2015, 1, 28, 0, 0),
 datetime.datetime(2015, 1, 29, 0, 0),
 datetime.datetime(2015, 1, 30, 0, 0),
 datetime.datetime(2015, 1, 31, 0, 0),
 datetime.datetime(2015, 2, 1, 0, 0)]

# +-3 hour noise, +5 minute noise
In [4]: range(2015, '2015-01-15', hours=3, minutes=(0, 5))
Out[4]:
[datetime.datetime(2015, 1, 1, 3, 1),
 datetime.datetime(2015, 1, 2, 0, 3),
 datetime.datetime(2015, 1, 3, 2, 0),
 datetime.datetime(2015, 1, 3, 22, 2),
 datetime.datetime(2015, 1, 4, 22, 3),
 datetime.datetime(2015, 1, 6, 0, 2),
 datetime.datetime(2015, 1, 7, 0, 4),
 datetime.datetime(2015, 1, 8, 0, 4),
 datetime.datetime(2015, 1, 8, 21, 3),
 datetime.datetime(2015, 1, 9, 22, 0),
 datetime.datetime(2015, 1, 11, 0, 0),
 datetime.datetime(2015, 1, 11, 22, 1),
 datetime.datetime(2015, 1, 12, 22, 5),
 datetime.datetime(2015, 1, 14, 3, 0),
 datetime.datetime(2015, 1, 15, 2, 5)]

# it is able to specify minus direction as interval.
In [5]: range(start='2015-5-22', end='2015-04-22', interval=timedelta(days=-1))
Out[5]:
[datetime.datetime(2015, 5, 22, 0, 0),
 datetime.datetime(2015, 5, 21, 0, 0),
 datetime.datetime(2015, 5, 20, 0, 0),
 datetime.datetime(2015, 5, 19, 0, 0),
 datetime.datetime(2015, 5, 18, 0, 0),
 datetime.datetime(2015, 5, 17, 0, 0),
 datetime.datetime(2015, 5, 16, 0, 0),
 datetime.datetime(2015, 5, 15, 0, 0),
 datetime.datetime(2015, 5, 14, 0, 0),
 datetime.datetime(2015, 5, 13, 0, 0),
 datetime.datetime(2015, 5, 12, 0, 0),
 datetime.datetime(2015, 5, 11, 0, 0),
 datetime.datetime(2015, 5, 10, 0, 0),
 datetime.datetime(2015, 5, 9, 0, 0),
 datetime.datetime(2015, 5, 8, 0, 0),
 datetime.datetime(2015, 5, 7, 0, 0),
 datetime.datetime(2015, 5, 6, 0, 0),
 datetime.datetime(2015, 5, 5, 0, 0),
 datetime.datetime(2015, 5, 4, 0, 0),
 datetime.datetime(2015, 5, 3, 0, 0),
 datetime.datetime(2015, 5, 2, 0, 0),
 datetime.datetime(2015, 5, 1, 0, 0),
 datetime.datetime(2015, 4, 30, 0, 0),
 datetime.datetime(2015, 4, 29, 0, 0),
 datetime.datetime(2015, 4, 28, 0, 0),
 datetime.datetime(2015, 4, 27, 0, 0),
 datetime.datetime(2015, 4, 26, 0, 0),
 datetime.datetime(2015, 4, 25, 0, 0),
 datetime.datetime(2015, 4, 24, 0, 0),
 datetime.datetime(2015, 4, 23, 0, 0),
 datetime.datetime(2015, 4, 22, 0, 0)]

常见

噪声

可以通过指定噪声参数来指定实际时间之间的间隔。允许指定噪声参数的函数是“datetimes.generator”和“datetimes.range”。

**noise以kwargs格式指定,它们不是必需的。

可用的键与timedelta-args相同。

  • 天数

  • 小时

  • 分钟

  • 微秒

argtype

除了datetime类型之外的接受参数如下。

int:

它被评估为

str:

它被解析为从字符串的数字部分来的datetime

tuple:

它将被传递到datetime的参数中。

dict:

它将被传递到datetime的关键字参数中。

date:

它将被转换为datetime类型。

历史记录

1.0.x

初始化。

项目详情


下载文件

下载适合您平台的文件。如果您不确定该选择哪个,请了解有关安装包的更多信息。

源分布

datafactory-1.0.1.tar.gz (22.1 kB 查看哈希值)

上传时间

构建分布

datafactory-1.0.1-py3-none-any.whl (36.4 kB 查看哈希值)

上传时间 Python 3