跳转到主要内容

易于使用且处于领先水平的神经机器翻译

项目描述

EasyNMT - 易于使用且处于领先水平的神经机器翻译

本软件包提供易于使用且处于领先水平的机器翻译,支持超过100+种语言。本软件包的亮点包括

  • 易于安装和使用:使用3行代码即可使用最先进的机器翻译
  • 自动下载预训练的机器翻译模型
  • 支持150+种语言的翻译
  • 自动检测170+种语言
  • 句子和文档翻译
  • 多GPU和多进程翻译

目前,我们提供以下模型

示例

Docker & REST-API

我们提供可立即使用的Docker镜像,它将EasyNMT封装在REST API中

docker run -p 24080:80 easynmt/api:2.0-cpu

调用REST API

http://localhost:24080/translate?target_lang=en&text=Hallo%20Welt

有关不同Docker镜像和REST API端点的更多信息,请参阅docker/

还可以查看我们的EasyNMT Google Colab REST API托管示例,了解如何使用Google Colab和免费GPU托管翻译API。

Python安装

您可以通过以下方式安装软件包

pip install -U easynmt

模型基于PyTorch。如果您有可用的GPU,请参阅如何安装带有GPU支持的PyTorch。如果您使用Windows且在安装过程中遇到问题,请参阅此问题了解如何解决。

使用方法

使用方法简单

from easynmt import EasyNMT
model = EasyNMT('opus-mt')

#Translate a single sentence to German
print(model.translate('This is a sentence we want to translate to German', target_lang='de'))

#Translate several sentences to German
sentences = ['You can define a list with sentences.',
             'All sentences are translated to your target language.',
             'Note, you could also mix the languages of the sentences.']
print(model.translate(sentences, target_lang='de'))

文档翻译

可用的模型基于Transformer架构,提供最先进的翻译质量。然而,对于opus-mt模型,输入长度限制为512个单词片段,对于M2M模型,限制为1024个单词片段。

translate()函数执行自动句子分割,以便能够翻译较长的文档

from easynmt import EasyNMT
model = EasyNMT('opus-mt')

document = """Berlin is the capital and largest city of Germany by both area and population.[6][7] Its 3,769,495 inhabitants as of 31 December 2019[2] make it the most-populous city of the European Union, according to population within city limits.[8] The city is also one of Germany's 16 federal states. It is surrounded by the state of Brandenburg, and contiguous with Potsdam, Brandenburg's capital. The two cities are at the center of the Berlin-Brandenburg capital region, which is, with about six million inhabitants and an area of more than 30,000 km2,[9] Germany's third-largest metropolitan region after the Rhine-Ruhr and Rhine-Main regions. Berlin straddles the banks of the River Spree, which flows into the River Havel (a tributary of the River Elbe) in the western borough of Spandau. Among the city's main topographical features are the many lakes in the western and southeastern boroughs formed by the Spree, Havel, and Dahme rivers (the largest of which is Lake Müggelsee). Due to its location in the European Plain, Berlin is influenced by a temperate seasonal climate. About one-third of the city's area is composed of forests, parks, gardens, rivers, canals and lakes.[10] The city lies in the Central German dialect area, the Berlin dialect being a variant of the Lusatian-New Marchian dialects.

First documented in the 13th century and at the crossing of two important historic trade routes,[11] Berlin became the capital of the Margraviate of Brandenburg (1417–1701), the Kingdom of Prussia (1701–1918), the German Empire (1871–1918), the Weimar Republic (1919–1933), and the Third Reich (1933–1945).[12] Berlin in the 1920s was the third-largest municipality in the world.[13] After World War II and its subsequent occupation by the victorious countries, the city was divided; West Berlin became a de facto West German exclave, surrounded by the Berlin Wall (1961–1989) and East German territory.[14] East Berlin was declared capital of East Germany, while Bonn became the West German capital. Following German reunification in 1990, Berlin once again became the capital of all of Germany.

Berlin is a world city of culture, politics, media and science.[15][16][17][18] Its economy is based on high-tech firms and the service sector, encompassing a diverse range of creative industries, research facilities, media corporations and convention venues.[19][20] Berlin serves as a continental hub for air and rail traffic and has a highly complex public transportation network. The metropolis is a popular tourist destination.[21] Significant industries also include IT, pharmaceuticals, biomedical engineering, clean tech, biotechnology, construction and electronics."""

#Translate the document to German
print(model.translate(document, target_lang='de'))

该函数将文档分解为句子,然后使用指定的模型逐个翻译这些句子。

自动语言检测

您可以为translate方法设置source_lang来定义源语言。如果未设置source_langfastText将自动确定源语言。这也允许您提供包含各种语言的句子/文档列表

from easynmt import EasyNMT
model = EasyNMT('opus-mt')

#Translate several sentences to English
sentences = ['Dies ist ein Satz in Deutsch.',   #This is a German sentence
             '这是一个中文句子',    #This is a chinese sentence
             'Esta es una oración en español.'] #This is a spanish sentence
print(model.translate(sentences, target_lang='en'))

可用模型

以下模型目前可用。它们提供150多种语言的翻译。

模型 参考 语言数量 大小 GPU速度(V100上每秒句子数) CPU速度(每秒句子数) 备注
opus-mt 赫尔辛基-NLP 186 300 MB 50 6 每个翻译方向的单个模型(约300 MB)
mbart50_m2m Facebook Research 52 2.3 GB 25 -
mbart50_m2en Facebook Research 52 2.3 GB 25 - 只能从其他语言翻译到英语。
mbart50_en2m Facebook Research 52 2.3 GB 25 - 只能从英语翻译到其他语言。
m2m_100_418M Facebook Research 100 1.8 GB 22 -
m2m_100_1.2B Facebook Research 100 5.0 GB 13 -

翻译质量

比较模型翻译质量将很快添加到此处。到目前为止,我的个人主观印象是,opus-mtm2m_100_1.2B产生最佳的翻译。

Opus-MT

我们为来自Opus-MT预训练模型提供了一个包装器。

Opus-MT提供了1200多个不同的翻译模型,每个模型都能翻译一个方向(例如,从德语翻译到英语)。每个模型的大小约为300 MB。

支持的语言: aav, aed, af, alv, am, ar, art, ase, az, bat, bcl, be, bem, ber, bg, bi, bn, bnt, bzs, ca, cau, ccs, ceb, cel, chk, cpf, crs, cs, csg, csn, cus, cy, da, de, dra, ee, efi, el, en, eo, es, et, eu, euq, fi, fj, fr, fse, ga, gaa, gil, gl, grk, guw, gv, ha, he, hi, hil, ho, hr, ht, hu, hy, id, ig, ilo, is, iso, it, ja, jap, ka, kab, kg, kj, kl, ko, kqn, kwn, kwy, lg, ln, loz, lt, lu, lua, lue, lun, luo, lus, lv, map, mfe, mfs, mg, mh, mk, mkh, ml, mos, mr, ms, mt, mul, ng, nic, niu, nl, no, nso, ny, nyk, om, pa, pag, pap, phi, pis, pl, pon, poz, pqe, pqw, prl, pt, rn, rnd, ro, roa, ru, run, rw, sal, sg, sh, sit, sk, sl, sm, sn, sq, srn, ss, ssp, st, sv, sw, swc, taw, tdt, th, ti, tiv, tl, tll, tn, to, toi, tpi, tr, trk, ts, tum, tut, tvl, tw, ty, tzo, uk, umb, ur, ve, vi, vsl, wa, wal, war, wls, xh, yap, yo, yua, zai, zh, zne

使用方法

from easynmt import EasyNMT
model = EasyNMT('opus-mt', max_loaded_models=10)

系统将自动检测合适的Opus-MT模型并将其加载。通过可选参数max_loaded_models,您可以指定同时加载的最大模型数量。如果您使用未见过的语言方向进行翻译,则最旧的模型将被卸载,新的模型将被加载。

mBERT_50

我们为Facebook的mBART50模型提供了一个包装器,该模型能够翻译50+语言之间的任何一对语言。还有可用于从英语翻译到这些语言或反之亦然的模型。

使用方法

from easynmt import EasyNMT
model = EasyNMT('mbart50_m2m')

支持的语言:af、ar、az、bn、cs、de、en、es、et、fa、fi、fr、gl、gu、he、hi、hr、id、it、ja、ka、kk、km、ko、lt、lv、mk、ml、mn、mr、my、ne、nl、pl、ps、pt、ro、ru、si、sl、sv、sw、ta、te、th、tl、tr、uk、ur、vi、xh、zh

M2M_100

我们为Facebook的M2M 100模型提供了一个包装器,该模型能够翻译100种语言中的任何一对语言。

支持的语言:af、am、ar、ast、az、ba、be、bg、bn、br、bs、ca、ceb、cs、cy、da、de、el、en、es、et、fa、ff、fi、fr、fy、ga、gd、gl、gu、ha、he、hi、hr、ht、hu、hy、id、ig、ilo、is、it、ja、jv、ka、kk、km、kn、ko、lb、lg、ln、lo、lt、lv、mg、mk、ml、mn、mr、ms、my、ne、nl、no、ns、oc、or、pa、pl、ps、pt、ro、ru、sd、si、sk、sl、so、sq、sr、ss、su、sv、sw、ta、th、tl、tn、tr、uk、ur、uz、vi、wo、xh、yi、yo、zh、zu

目前,我们为两个M2M 100模型提供包装器

  • m2m_100_418M:418百万参数的M2M模型(1.8 GB)
  • m2m_100_1.2B:12亿参数的M2M模型(5.0 GB)

使用方法

from easynmt import EasyNMT
model = EasyNMT('m2m_100_418M')   #or: EasyNMT('m2m_100_1.2B') 

您可以在这里找到更多信息。注意:12亿参数的M2M模型目前不支持。

一旦您调用EasyNMT('m2m_100_418M') / EasyNMT('m2m_100_1.2B'),相应的模型将被下载并本地缓存。

作者

联系人:[Nils Reimers](https://www.nils-reimers.de);[info@nils-reimers.de](mailto:info@nils-reimers.de)

https://www.ukp.tu-darmstadt.de/

如果您有任何问题或遇到任何问题(不应出现),请不要犹豫,给我们发电子邮件或报告问题。

此存储库包含实验性软件,以鼓励未来的研究。

项目详情


下载文件

下载适用于您的平台的文件。如果您不确定要选择哪个,请了解更多关于安装包的信息。

源分布

EasyNMT-2.0.2.tar.gz (23.7 kB 查看哈希

上传时间

由以下机构支持

AWS AWS 云计算和安全赞助商 Datadog Datadog 监控 Fastly Fastly CDN Google Google 下载分析 Microsoft Microsoft PSF 赞助商 Pingdom Pingdom 监控 Sentry Sentry 错误记录 StatusPage StatusPage 状态页面