用于解析维基百科引用模板的解析器
项目描述
wikiciteparser

此Python库包装了维基百科的引用处理代码(用Lua编写)以解析引用模板。例如,有许多不同的方式来指定引用的作者:此代码将它们映射到相同的表示。
在MIT许可证下分发。
基于以下Python库构建
lupa
mwparserfromhell
mwxml
mwtypes
要从pypi.org安装lupa,您将需要libpython和liblua(在基于Debian的Linux系统上为libpython-dev和liblua5.2-dev)。
使用pip安装此包(例如,在虚拟环境中)
pip install wikiciteparser
示例
让我们解析这篇文章的参考文献部分
import mwparserfromhell
from wikiciteparser.parser import parse_citation_template
mwtext = """
===Articles===
* {{Citation | last1=Lambek | first1=Joachim | author1-link=Joachim Lambek | last2=Moser | first2=L. | title=Inverse and Complementary Sequences of Natural Numbers| doi=10.2307/2308078 | mr=0062777 | journal=[[American Mathematical Monthly|The American Mathematical Monthly]] | issn=0002-9890 | volume=61 | issue=7 | pages=454–458 | year=1954 | jstor=2308078 | publisher=The American Mathematical Monthly, Vol. 61, No. 7}}
* {{Citation | last1=Lambek | first1=J. | author1-link=Joachim Lambek | title=The Mathematics of Sentence Structure | year=1958 | journal=[[American Mathematical Monthly|The American Mathematical Monthly]] | issn=0002-9890 | volume=65 | pages=154–170 | doi=10.2307/2310058 | issue=3 | publisher=The American Mathematical Monthly, Vol. 65, No. 3 | jstor=1480361}}
*{{Citation | last1=Lambek | first1=Joachim | author1-link=Joachim Lambek | title=Bicommutators of nice injectives | doi=10.1016/0021-8693(72)90034-8 | mr=0301052 | year=1972 | journal=Journal of Algebra | issn=0021-8693 | volume=21 | pages=60–73}}
*{{Citation | last1=Lambek | first1=Joachim | author1-link=Joachim Lambek | title=Localization and completion | doi=10.1016/0022-4049(72)90011-4 | mr=0320047 | year=1972 | journal=Journal of Pure and Applied Algebra | issn=0022-4049 | volume=2 | pages=343–370 | issue=4}}
*{{Citation | last1=Lambek | first1=Joachim | author1-link=Joachim Lambek | title=A mathematician looks at Latin conjugation | mr=589163 | year=1979 | journal=Theoretical Linguistics | issn=0301-4428 | volume=6 | issue=2 | pages=221–234 | doi=10.1515/thli.1979.6.1-3.221}}
"""
wikicode = mwparserfromhell.parse(mwtext)
for tpl in wikicode.filter_templates():
parsed = parse_citation_template(tpl)
print(parsed)
以下是您将获得的内容
{'PublisherName': 'The American Mathematical Monthly, Vol. 61, No. 7', 'Title': 'Inverse and Complementary Sequences of Natural Numbers', 'ID_list': {'DOI': '10.2307/2308078', 'ISSN': '0002-9890', 'MR': '0062777', 'JSTOR': '2308078'}, 'Periodical': 'The American Mathematical Monthly', 'Authors': [{'link': 'Joachim Lambek', 'last': 'Lambek', 'first': 'Joachim'}, {'last': 'Moser', 'first': 'L.'}], 'Date': '1954', 'Pages': '454-458'}
{'PublisherName': 'The American Mathematical Monthly, Vol. 65, No. 3', 'Title': 'The Mathematics of Sentence Structure', 'ID_list': {'DOI': '10.2307/2310058', 'ISSN': '0002-9890', 'JSTOR': '1480361'}, 'Periodical': 'The American Mathematical Monthly', 'Authors': [{'link': 'Joachim Lambek', 'last': 'Lambek', 'first': 'J.'}], 'Date': '1958', 'Pages': '154-170'}
{'Title': 'Bicommutators of nice injectives', 'ID_list': {'DOI': '10.1016/0021-8693(72)90034-8', 'ISSN': '0021-8693', 'MR': '0301052'}, 'Periodical': 'Journal of Algebra', 'Authors': [{'link': 'Joachim Lambek', 'last': 'Lambek', 'first': 'Joachim'}], 'Date': '1972', 'Pages': '60-73'}
{'Title': 'Localization and completion', 'ID_list': {'DOI': '10.1016/0022-4049(72)90011-4', 'ISSN': '0022-4049', 'MR': '0320047'}, 'Periodical': 'Journal of Pure and Applied Algebra', 'Authors': [{'link': 'Joachim Lambek', 'last': 'Lambek', 'first': 'Joachim'}], 'Date': '1972', 'Pages': '343-370'}
{'Title': 'A mathematician looks at Latin conjugation', 'ID_list': {'DOI': '10.1515/thli.1979.6.1-3.221', 'ISSN': '0301-4428', 'MR': '589163'}, 'Periodical': 'Theoretical Linguistics', 'Authors': [{'link': 'Joachim Lambek', 'last': 'Lambek', 'first': 'Joachim'}], 'Date': '1979', 'Pages': '221-234'}
批量处理
还有批量处理模式,可以与来自dumps.wikimedia.org的批量XML文件一起使用。
安装了wikiciteparser
(或在当前的python路径中),运行类似UNIX shell的命令:
python3 -m wikiciteparser.bulk enwiki-20210801-pages-articles1.xml-p1p41242.bz2 | pv -l | gzip > enwiki-20210801-pages-articles1.xml-p1p41242.citations.json.gz
GNU/parallel可用于在单台计算机上同时运行多个文件的处理。
项目详情
关闭
wikiciteparser-0.3.0.tar.gz的哈希值
算法 | 哈希摘要 | |
---|---|---|
SHA256 | a240263ce716db60783d26b2f017772647b55f433239173640de4ae9c8b66f0f |
|
MD5 | 35d5dc1cb7e3fe2b52b5e9d927988759 |
|
BLAKE2b-256 | 19c255d21c6cc65d5251dd7449e0b5c107c888b3681e7c957c0028653293ceaa |
关闭
wikiciteparser-0.3.0-py2.py3-none-any.whl的哈希值
算法 | 哈希摘要 | |
---|---|---|
SHA256 | da0b288b3d636e512039c3cbf93e3131df8163d68bf85c89a6accb4b19aa105c |
|
MD5 | 5241cfbf76bad09ef4441297339d8500 |
|
BLAKE2b-256 | 1f0f97579aae13fd56fb7cf92fc01b5f7f54c84b8f4913f268056ec373686cfc |