使用潜在狄利克雷分配进行主题建模
项目描述
注意:此包处于维护模式。将修复关键错误。不会添加新功能。
lda使用折叠高斯采样实现潜在狄利克雷分配 (LDA)。lda速度快,已在Linux、OS X和Windows上进行测试。
您可以在文档中了解更多关于lda的信息。
安装
pip install lda
入门
lda.LDA实现了潜在狄利克雷分配 (LDA)。该接口遵循scikit-learn中的约定。
以下演示了如何检查Reuter新闻数据集子集的模型。下面的输入X是文档-词矩阵(接受稀疏矩阵)。
>>> import numpy as np
>>> import lda
>>> import lda.datasets
>>> X = lda.datasets.load_reuters()
>>> vocab = lda.datasets.load_reuters_vocab()
>>> titles = lda.datasets.load_reuters_titles()
>>> X.shape
(395, 4258)
>>> X.sum()
84010
>>> model = lda.LDA(n_topics=20, n_iter=1500, random_state=1)
>>> model.fit(X) # model.fit_transform(X) is also available
>>> topic_word = model.topic_word_ # model.components_ also works
>>> n_top_words = 8
>>> for i, topic_dist in enumerate(topic_word):
... topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
... print('Topic {}: {}'.format(i, ' '.join(topic_words)))
Topic 0: british churchill sale million major letters west britain
Topic 1: church government political country state people party against
Topic 2: elvis king fans presley life concert young death
Topic 3: yeltsin russian russia president kremlin moscow michael operation
Topic 4: pope vatican paul john surgery hospital pontiff rome
Topic 5: family funeral police miami versace cunanan city service
Topic 6: simpson former years court president wife south church
Topic 7: order mother successor election nuns church nirmala head
Topic 8: charles prince diana royal king queen parker bowles
Topic 9: film french france against bardot paris poster animal
Topic 10: germany german war nazi letter christian book jews
Topic 11: east peace prize award timor quebec belo leader
Topic 12: n't life show told very love television father
Topic 13: years year time last church world people say
Topic 14: mother teresa heart calcutta charity nun hospital missionaries
Topic 15: city salonika capital buddhist cultural vietnam byzantine show
Topic 16: music tour opera singer israel people film israeli
Topic 17: church catholic bernardin cardinal bishop wright death cancer
Topic 18: harriman clinton u.s ambassador paris president churchill france
Topic 19: city museum art exhibition century million churches set
文档-主题分布在model.doc_topic_中可用。
>>> doc_topic = model.doc_topic_
>>> for i in range(10):
... print("{} (top topic: {})".format(titles[i], doc_topic[i].argmax()))
0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20 (top topic: 8)
1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21 (top topic: 13)
2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23 (top topic: 14)
3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25 (top topic: 8)
4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25 (top topic: 14)
5 INDIA: Mother Teresa's condition unchanged, thousands pray. CALCUTTA 1996-08-25 (top topic: 14)
6 INDIA: Mother Teresa shows signs of strength, blesses nuns. CALCUTTA 1996-08-26 (top topic: 14)
7 INDIA: Mother Teresa's condition improves, many pray. CALCUTTA, India 1996-08-25 (top topic: 14)
8 INDIA: Mother Teresa improves, nuns pray for "miracle". CALCUTTA 1996-08-26 (top topic: 14)
9 UK: Charles under fire over prospect of Queen Camilla. LONDON 1996-08-26 (top topic: 8)
要求
Python ≥3.10和NumPy。
注意事项
lda追求简洁。(它恰好也很快,因为关键部分是用C语言通过Cython编写的。)如果你正在处理一个非常大的语料库,你可能希望使用更复杂的主题模型,如hca和MALLET中实现的主题模型。hca完全是用C语言编写的,而MALLET是用Java编写的。与lda不同,hca可以一次使用多个处理器。MALLET和hca都实现了比标准潜在狄利克雷分配更鲁棒的主题模型。
备注
潜在狄利克雷分配在Blei等人(2003年)和Pritchard等人(2000年)的论文中有描述。折叠高斯采样推理在Griffiths和Steyvers(2004年)中有描述。
重要链接
其他实现
scikit-learn的LatentDirichletAllocation(使用在线变分推理)
gensim(使用在线变分推理)
许可证
lda采用Mozilla公共许可证第2.0版。
项目详情
下载文件
下载适合您平台的文件。如果您不确定要选择哪一个,请了解更多关于安装包的信息。
源代码分发
lda-3.0.2.tar.gz (165.7 kB 查看哈希值)
构建分发
lda-3.0.2-cp312-cp312-win_amd64.whl (380.2 kB 查看哈希值)
lda-3.0.2-cp312-cp312-macosx_14_0_arm64.whl (274.5 kB 查看哈希值)
lda-3.0.2-cp311-cp311-win_amd64.whl (355.9 kB 查看哈希值)
lda-3.0.2-cp311-cp311-macosx_14_0_arm64.whl (267.9 kB 查看哈希值)
lda-3.0.2-cp310-cp310-win_amd64.whl (356.4 kB 查看哈希值)
lda-3.0.2-cp310-cp310-macosx_14_0_arm64.whl (270.0 kB 查看哈希值)
关闭
lda-3.0.2.tar.gz 的哈希值
算法 | 哈希摘要 | |
---|---|---|
SHA256 | 76fc6fbb066b6d1ec0360a1541c5e1c8b69a728666525e72644c4d5332fc778a |
|
MD5 | fdcc089e5d9408e6a255d9663bad5c3a |
|
BLAKE2b-256 | dd46ffc9667172d794fd17daaf296c684ec8ed0d2a3fe9d557014407dfed64fe |
关闭
哈希用于lda-3.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
算法 | 哈希摘要 | |
---|---|---|
SHA256 | 5b2f40fdbf221192c48e1628e0a3f815247e64d27daf8cee5f7b3106d70ce0a7 |
|
MD5 | ca9e9ce80c69970f67cc84fb6565b233 |
|
BLAKE2b-256 | c6d0ffabbf59deae9ef776d9843e04ab865399f627ed35fadbde0cc8250f2ac2 |
关闭
哈希用于lda-3.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
算法 | 哈希摘要 | |
---|---|---|
SHA256 | 5092577d2213890f0e1df4e6db02cc50955de97ca94bd57b3405dd53e21c1ab3 |
|
MD5 | b1bc796ce159c1e1ba546be33c698d49 |
|
BLAKE2b-256 | be0d8d86a51d49d87bfae2edebe660589cee02423df08bb4322edeac5b9fccfd |
关闭
哈希用于lda-3.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
算法 | 哈希摘要 | |
---|---|---|
SHA256 | 69783191736d1b6253edc84d0c2fa2f4189ef04c18680201c0d77866cad38f62 |
|
MD5 | a15d794045677b21221414beb5e0cc0d |
|
BLAKE2b-256 | b4c208572d76335ac6f4b77a67811eceaa79887b874292c6784cfbe6a4f905b7 |