wildgram · PyPI · Python 包索引

wildgram根据文本中的自然语言断点对文本进行标记，并将标记分离成不同大小的ngram（词组）。

这些详情尚未由PyPI验证

项目链接

主页

项目描述

Wildgram将英文文本标记为“wild”-gram（不同词数的标记），这些标记与对话的自然停顿非常接近。我最初将其构建为医疗语言抽象管道的第一步：由于医疗概念往往是不同长度的短语，因此词袋或二元组并不真正适用。

Wildgram通过测量噪声的大小（停用词、标点符号和空白）并使文本与一定大小的噪声（根据噪声略有不同）断开，从而工作。

参数

text 必需：是默认：否它是什么：您想要标记为wildgram的文本。

stopwords 必需：否默认：STOPWORDS列表（可导入，大多基于NLTK的停用词列表）它是什么：您想要标记为噪声的停用词列表，它们将在标记之间充当断点。自定义覆盖：您想要分割的字符串列表。

topicwords 必需：否默认：TOPICWORDS列表（可导入）它是什么：您想要标记为标记的停用词列表，因为它们具有意义，但通常用于断开较大的文本块。例如包括数字、否定词如“won't”等。包括数字、否定等。以数字开头并以非空格非数字字符串结尾的单词被分割，因为假设它们具有意义上的区别——例如“123mg”->“123”，“mg”。自定义覆盖：您想要分割的字符串列表。您还可以存储混合列表的字典和字符串，例如，默认情况下任何否定停用词（如“no”）具有“否定”的tokenType。如果没有设置tokenType，则类型为“标记”。

include1gram 必选：否默认：是什么是：当设置为是时，wildgram 将返回它找到的每个单个单词或标记以及任何短语。自定义覆盖：布尔型（false）。当设置为 false 时，wildgram 只返回它找到的短语，而不是 1grams。

joinerwords 必选：否默认：JOINERWORDS 列表（可导入，如“of”等单词）什么是：一个连接两个短语的停用词列表（如果覆盖，也必须包含在停用词列表中）。例如：“呼吸短促” -> “呼吸短促”，“呼吸”，“呼吸短促”。自定义覆盖：您想要连接的字符串列表。单词必须包含在停用词列表中才能使用。假设您不会想要既是连接词又是主题词的连接词。

returnNoise 必选：否默认：是什么是：当设置为是时，wildgram 将返回它创建的每个单个噪声标记以找到短语。自定义覆盖：布尔型（false）。当设置为 false 时，它不会返回噪声标记。

includeParent 必选：否默认：否注意：正在被弃用，因为在主题组织过程中我没有发现它很有用。什么是：当设置为是时，wildgram 将返回标记的“父级”，在伪依赖树中。这个树是使用先前（在文本中）的标点符号的排名列表生成的，以近似标记之间的关系。噪声标记作为分支节点，而普通标记只能作为叶节点，因此实际上这是用来确定标记的“叔叔”。这种用法可能很有用，例如将列表元素链接到更大的标题下或根据上下文确定数字的单位（这可能在同一行上）。由于噪声标记是分支节点，如果 includeParent 为 true，则必须将 returnNoise 设置为 true。自定义覆盖：布尔型（True）。当设置为 True 时，它不会返回父级。

返回：字典列表，每个字典的格式如下

example = {
"startIndex": 0,
"endIndex", 5,
"token": "hello",
"tokenType": "token" # if noise, token type is "noise"
"index": 0
}

列表按升序（最小->最大）对 startIndex 进行排序，然后是 endIndex。

示例代码

from wildgram import wildgram
ranges = wildgram("and was beautiful", returnNoise=False)

#[{
#"startIndex": 8,
#"endIndex", 17,
#"token": "beautiful",
#"tokenType": "token",
# "index": 0
#}]

from wildgram import wildgram
ranges = wildgram("and was beautiful day")
print(ranges)
'''
[{
  "startIndex": 0,
  "endIndex": 8,
  "token": "and was ",
  "tokenType": "noise",
  "index": 0
},
{
  "startIndex": 8,
  "endIndex": 17,
  "token": "beautiful",
  "tokenType": "token",
  "index": 1
},
{
  "startIndex": 8,
  "endIndex": 21,
  "token": "beautiful day",
  "tokenType": "token",
  "index": 2
},
{
  "startIndex": 17,
  "endIndex": 18,
  "token": " ",
  "tokenType": "noise",
  "index": 3
},
{
  "startIndex": 18,
  "endIndex": 21,
  "token": "day",
  "tokenType": "token",
  "index": 4
}
]
'''

版本 >= 0.2.9 时，还有 WildRules 类。它将一组规则应用于标记化的 wildgram，创建一个基于规则的分类器。这将在未来版本中针对速度等进行优化。在以后的版本中，它还允许您指定附近的短语。

版本 >= 0.4.1 时，还有 WildForm 类。这允许您将 wildrules 的输出组合成可能重叠或不完整的表单。在以后的版本中，我们将添加额外的验证功能示例。

from wildgram import WildRules, WildForm

test= WildRules([{"unit": "TEST", "value": "unknown", "spans": ["testing", "test"], "spanType": "token", "nearby": [{"spanType": "token", "spans": ["1234"]}]}, {"unit": "Dosage", "value": {"asType": "float", "spanType": "token"}, "spans": ["numeric"], "spanType": "tokenType"}])
ret = test.apply("testing test 123")
# note the unit for testing test is unknown, because it is missing 1234 in the general area
# note it can do basic parsing for values, say numbers.
[{'unit': 'unknown', 'value': "unknown" 'token': 'testing test', 'startIndex': 0, 'endIndex': 12}, {'unit': 'Dosage', "value": 123.0, 'token': '123', 'startIndex': 13, 'endIndex': 16}]

ret = test.apply("testing test 1234")
## returns the unit TEST, since 1234 is in the area
[{'unit': 'TEST', 'value': "unknown" 'token': 'testing test', 'startIndex': 0, 'endIndex': 12}, {'unit': 'Dosage', "value": 1234.0, 'token': '1234', 'startIndex': 13, 'endIndex': 17}]

forms = WildForm()
## lets add a basic form, with one "question" (e.g. a unit-value pair where the value is "")
forms.add_form({"unit": "test", "value": "testing", "children": [{"unit": "test", "value": "", "children": []}]})


## lets add a second form, with two "questions"
forms.add_form({"unit": "test", "value": "testing", "children": [{"unit": "test", "value": "", "children": []}, {"unit": "Dosage", "value": "", "children": []}]})
## lets apply this to this phrase:
rules = WildRules([{"unit": "test", "value": "unknown", "spans": ["testing", "test"], "spanType": "token"}, {"unit": "Dosage", 'value': {"spanType": "token", "asType": "float"}, "spans": ["numeric"], "spanType": "tokenType"}])

ret = rules.apply("testing, can anyone hear me? testing 1234")
## output:
[{'unit': 'test', 'value': 'unknown', 'token': 'testing', 'startIndex': 0, 'endIndex': 7}, {'unit': 'unknown', 'value': 'unknown', 'token': 'anyone hear me', 'startIndex': 13, 'endIndex': 27}, {'unit': 'test', 'value': 'unknown', 'token': 'testing', 'startIndex': 29, 'endIndex': 36}, {'unit': 'Dosage', 'value': 1234.0, 'token': '1234', 'startIndex': 37, 'endIndex': 41}]

forms.apply(ret)
## returns
## note: returns four forms: 2 filled out copies of the first form (for each instance of "testing", note start/endIndex)
## 2 copies of the second form: note that 1 copy has a missing value for dosage, since in 1 instance of testing there
## is no value of dosage that is not nearer to the previous
## so inter-form overlap is possible, but not intra-form overlap
## tokens are assigned right to left, so if there is a conflict the value belongs to the stuff on the left, and then the
## new question gets to start its own form even if the other form is incomplete
## it keeps track of the closest token (from rules) and if there are >= 3 tokens between the closest token in the form
## and the current one it also creates a new form, since it assumes the information will be close together
## this assumption may be modified, or overridden in time. I haven't decided yet, but it holds up pretty well for the things
## i want to pull from notes.
[{'unit': 'test', 'value': 'testing', 'children': [{'unit': 'test', 'value': 'unknown', 'children': [], 'startIndex': 0, 'endIndex': 7, 'token': 'testing'}]}, {'unit': 'test', 'value': 'testing', 'children': [{'unit': 'test', 'value': 'unknown', 'children': [], 'startIndex': 29, 'endIndex': 36, 'token': 'testing'}]}, {'unit': 'test', 'value': 'testing', 'children': [{'unit': 'test', 'value': 'unknown', 'children': [], 'startIndex': 0, 'endIndex': 7, 'token': 'testing'}, {'unit': 'Dosage', 'value': '', 'children': []}]}, {'unit': 'test', 'value': 'testing', 'children': [{'unit': 'test', 'value': 'unknown', 'children': [], 'startIndex': 29, 'endIndex': 36, 'token': 'testing'}, {'unit': 'Dosage', 'value': 1234.0, 'children': [], 'startIndex': 37, 'endIndex': 41, 'token': '1234'}]}]

处理表单元信息

在您想要应用它的数据元素上，添加一个具有单位“EHR”和值“META-FORM”的子元素。

{your data element "children": [{'unit': 'EHR', 'value': 'META-FORM', 'children': []}]}

元信息可以无序地作为 EHR:META-FORM 对的子元素添加。

可用参数：“EHR”：“INTRA-FORM-SHARABLE” - 如果添加，则允许将相同的数据元素添加到同一表单的多个副本中。默认情况下，它假设元素不能跨同一表单的副本共享。例如，一个句子“恶心、腹泻和呕吐持续了3周”，将元素 weeks:3 与恶心 AND 腹泻 AND 呕吐关联起来。请注意，反之则不成立 - 因此“恶心、腹泻和呕吐持续了3周”，weeks:3 只与呕吐相关联，因为其含义不清楚（它们有恶心/腹泻/呕吐，还是呕吐持续了3周，或3周用于所有三者？）。

就这些了！

项目详情

这些详情尚未由PyPI验证

项目链接

主页

发布历史发布通知 | RSS 源

此版本

0.5.7

2021年9月9日

0.5.6

2021年9月9日

0.5.5

2021年9月9日

0.5.4

2021年9月9日

0.5.3

2021年9月9日

0.5.2

2021年9月8日

0.5.1

2021年9月2日

0.5.0

2021年9月2日

0.4.9

2021年9月2日

0.4.8

2021年9月1日

0.4.7

2021年9月1日

0.4.6

2021年9月1日

0.4.5

2021年8月31日

0.4.4

2021年8月3日

0.4.3

2021年8月3日

0.4.2

2021年8月3日

0.4.1

2021年8月2日

0.4.0

2021年8月2日

0.3.9

2021年7月24日

0.3.8

2021年7月23日

0.3.7

2021年7月23日

0.3.6

2021年7月15日

0.3.5

2021年7月15日

0.3.4

2021年7月15日

0.3.3

2021年7月14日

0.3.2

2021年7月13日

0.3.1

2021年7月13日

0.3.0

2021年6月29日

0.2.9

2021年6月29日

0.2.8

2021年6月23日

0.2.7

2021年5月4日

0.2.6

2021年5月4日

0.2.5

2021年5月4日

0.2.4

2021年5月3日

0.2.3

2021年3月10日

0.2.2

2021年3月10日

0.2.1

2021年3月10日

0.2.0

2021年3月9日

0.1.519

2021年2月23日

0.1.518

2021年2月23日

0.1.517

2021年2月23日

0.1.516

2021年2月23日

0.1.515

2021年2月23日

0.1.514

2021年2月23日

0.1.513

2021年2月23日

0.1.512

2021年2月23日

0.1.511

2021年2月23日

0.1.52

2021年3月9日

0.1.51

2021年2月23日

0.1.6

2021年3月9日

0.1.5

2021年2月22日

0.1.0

2021年2月22日

0.0.95

2021年2月18日

0.0.94

2021年2月12日

0.0.93

2021年2月12日

0.0.92

2021年1月29日

0.0.91

2021年1月26日

0.0.9

2021年1月26日

0.0.8

2021年1月26日

0.0.7

2021年1月26日

0.0.6

2021年1月22日

0.0.5

2021年1月19日

0.0.4

2021年1月13日

0.0.3

2021年1月12日

0.0.2

2021年1月8日

0.0.1

2021年1月8日

下载文件

下载适用于您平台的文件。如果您不确定选择哪个，请了解更多关于安装包的信息。

源代码分发

wildgram-0.5.7.tar.gz (17.8 kB 查看散列)

上传时间 2021年9月9日 源代码

构建分发

wildgram-0.5.7-py3-none-any.whl (14.6 kB 查看散列)

上传时间 2021年9月9日 Python 3

散列 for wildgram-0.5.7.tar.gz

散列 for wildgram-0.5.7.tar.gz
算法	散列摘要
SHA256	`4ba5d8cce21593b21d4107e4fbe37ed06f59beaad6845d616c1ff23787cfce8c`
MD5	`ef34a15108aff89370898a4c38101862`
BLAKE2b-256	`63cf685586c93cf20f1e26383cf5c6466e44a082342d616fc1594fed59d910bb`

散列 for wildgram-0.5.7-py3-none-any.whl

散列 for wildgram-0.5.7-py3-none-any.whl
算法	散列摘要
SHA256	`fe6fc149d813da566548c6c21be75324f14dac14afccc2c17f514b613bfbe37f`
MD5	`70001a56a0d3c6f8ad0be5c287e5c16e`
BLAKE2b-256	`a93bf3e8ace4c1bcfc8b2c82d5e8d038c464eaa485cd23650a5a16a2a9fecb9d`

wildgram 0.5.7

导航

验证详情

维护者

未验证详情

项目链接

元数据

分类器

项目描述

处理表单元信息

项目详情

验证详情

维护者

未验证详情

项目链接

元数据

分类器

发布历史发布通知 | RSS 源

下载文件

源代码分发

构建分发

wildgram 0.5.7

导航

验证详情

维护者

未验证详情

项目链接

元数据

分类器

项目描述

处理表单元信息

项目详情

验证详情

维护者

未验证详情

项目链接

元数据

分类器

发布历史 发布通知 | RSS 源

下载文件

源代码分发

构建分发

发布历史发布通知 | RSS 源