跳转到主要内容

使用潜在狄利克雷分配进行主题建模

项目描述

pypi version github actions build status Zenodo citation

注意:此包处于维护模式。将修复关键错误。不会添加新功能。

lda使用折叠高斯采样实现潜在狄利克雷分配 (LDA)。lda速度快,已在Linux、OS X和Windows上进行测试。

您可以在文档中了解更多关于lda的信息。

安装

pip install lda

入门

lda.LDA实现了潜在狄利克雷分配 (LDA)。该接口遵循scikit-learn中的约定。

以下演示了如何检查Reuter新闻数据集子集的模型。下面的输入X是文档-词矩阵(接受稀疏矩阵)。

>>> import numpy as np
>>> import lda
>>> import lda.datasets
>>> X = lda.datasets.load_reuters()
>>> vocab = lda.datasets.load_reuters_vocab()
>>> titles = lda.datasets.load_reuters_titles()
>>> X.shape
(395, 4258)
>>> X.sum()
84010
>>> model = lda.LDA(n_topics=20, n_iter=1500, random_state=1)
>>> model.fit(X)  # model.fit_transform(X) is also available
>>> topic_word = model.topic_word_  # model.components_ also works
>>> n_top_words = 8
>>> for i, topic_dist in enumerate(topic_word):
...     topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
...     print('Topic {}: {}'.format(i, ' '.join(topic_words)))

Topic 0: british churchill sale million major letters west britain
Topic 1: church government political country state people party against
Topic 2: elvis king fans presley life concert young death
Topic 3: yeltsin russian russia president kremlin moscow michael operation
Topic 4: pope vatican paul john surgery hospital pontiff rome
Topic 5: family funeral police miami versace cunanan city service
Topic 6: simpson former years court president wife south church
Topic 7: order mother successor election nuns church nirmala head
Topic 8: charles prince diana royal king queen parker bowles
Topic 9: film french france against bardot paris poster animal
Topic 10: germany german war nazi letter christian book jews
Topic 11: east peace prize award timor quebec belo leader
Topic 12: n't life show told very love television father
Topic 13: years year time last church world people say
Topic 14: mother teresa heart calcutta charity nun hospital missionaries
Topic 15: city salonika capital buddhist cultural vietnam byzantine show
Topic 16: music tour opera singer israel people film israeli
Topic 17: church catholic bernardin cardinal bishop wright death cancer
Topic 18: harriman clinton u.s ambassador paris president churchill france
Topic 19: city museum art exhibition century million churches set

文档-主题分布在model.doc_topic_中可用。

>>> doc_topic = model.doc_topic_
>>> for i in range(10):
...     print("{} (top topic: {})".format(titles[i], doc_topic[i].argmax()))
0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20 (top topic: 8)
1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21 (top topic: 13)
2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23 (top topic: 14)
3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25 (top topic: 8)
4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25 (top topic: 14)
5 INDIA: Mother Teresa's condition unchanged, thousands pray. CALCUTTA 1996-08-25 (top topic: 14)
6 INDIA: Mother Teresa shows signs of strength, blesses nuns. CALCUTTA 1996-08-26 (top topic: 14)
7 INDIA: Mother Teresa's condition improves, many pray. CALCUTTA, India 1996-08-25 (top topic: 14)
8 INDIA: Mother Teresa improves, nuns pray for "miracle". CALCUTTA 1996-08-26 (top topic: 14)
9 UK: Charles under fire over prospect of Queen Camilla. LONDON 1996-08-26 (top topic: 8)

要求

Python ≥3.10和NumPy。

注意事项

lda追求简洁。(它恰好也很快,因为关键部分是用C语言通过Cython编写的。)如果你正在处理一个非常大的语料库,你可能希望使用更复杂的主题模型,如hcaMALLET中实现的主题模型。hca完全是用C语言编写的,而MALLET是用Java编写的。与lda不同,hca可以一次使用多个处理器。MALLEThca都实现了比标准潜在狄利克雷分配更鲁棒的主题模型。

备注

潜在狄利克雷分配在Blei等人(2003年)Pritchard等人(2000年)的论文中有描述。折叠高斯采样推理在Griffiths和Steyvers(2004年)中有描述。

其他实现

许可证

lda采用Mozilla公共许可证第2.0版。

项目详情


下载文件

下载适合您平台的文件。如果您不确定要选择哪一个,请了解更多关于安装包的信息。

源代码分发

lda-3.0.2.tar.gz (165.7 kB 查看哈希值)

上传时间: 源代码

构建分发

lda-3.0.2-cp312-cp312-win_amd64.whl (380.2 kB 查看哈希值)

上传时间: CPython 3.12 Windows x86-64

lda-3.0.2-cp312-cp312-musllinux_1_2_x86_64.whl (375.2 kB 查看哈希值)

上传时间: CPython 3.12 musllinux: musl 1.2+ x86-64

lda-3.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (368.1 kB 查看哈希值)

上传时间 CPython 3.12 manylinux: glibc 2.17+ x86-64

lda-3.0.2-cp312-cp312-macosx_14_0_arm64.whl (274.5 kB 查看哈希值)

上传时间 CPython 3.12 macOS 14.0+ ARM64

lda-3.0.2-cp311-cp311-win_amd64.whl (355.9 kB 查看哈希值)

上传时间 CPython 3.11 Windows x86-64

lda-3.0.2-cp311-cp311-musllinux_1_2_x86_64.whl (352.6 kB 查看哈希值)

上传时间 CPython 3.11 musllinux: musl 1.2+ x86-64

lda-3.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (347.7 kB 查看哈希值)

上传时间 CPython 3.11 manylinux: glibc 2.17+ x86-64

lda-3.0.2-cp311-cp311-macosx_14_0_arm64.whl (267.9 kB 查看哈希值)

上传时间 CPython 3.11 macOS 14.0+ ARM64

lda-3.0.2-cp310-cp310-win_amd64.whl (356.4 kB 查看哈希值)

上传时间 CPython 3.10 Windows x86-64

lda-3.0.2-cp310-cp310-musllinux_1_2_x86_64.whl (354.3 kB 查看哈希值)

上传时间 CPython 3.10 musllinux: musl 1.2+ x86-64

lda-3.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (350.0 kB 查看哈希值)

上传时间 CPython 3.10 manylinux: glibc 2.17+ x86-64

lda-3.0.2-cp310-cp310-macosx_14_0_arm64.whl (270.0 kB 查看哈希值)

上传时间 CPython 3.10 macOS 14.0+ ARM64

支持者

AWS AWS 云计算和安全赞助商 Datadog Datadog 监控 Fastly Fastly CDN Google Google 下载分析 Microsoft Microsoft PSF 赞助商 Pingdom Pingdom 监控 Sentry Sentry 错误记录 StatusPage StatusPage 状态页面