Skip to main content

Kana kanji simple inversion library

Project description

Pykakasi

Overview

Documentation Status PyPI version Run Tox tests Azure-Pipelines Coverage status

pykakasi is a Python Natural Language Processing (NLP) library to transliterate hiragana, katakana and kanji (Japanese text) into rōmaji (Latin/Roman alphabet). It can handle characters in NFC form.

Its algorithms are based on the kakasi library, which is written in C.

Supported python versions

  • pykakasi supports python 3.6, 3.7, 3.8, 3.9, and pypy3

Usage

Transliterate Japanese text to kana, hiragana and romaji:

import pykakasi
kks = pykakasi.kakasi()
text = "かな漢字"
result = kks.convert(text)
for item in result:
    print("{}: kana '{}', hiragana '{}', romaji: '{}'".format(item['orig'], item['kana'], item['hira'], item['hepburn']))

かな: kana 'カナ', hiragana: 'かな', romaji: 'kana'
漢字: kana 'カンジ', hiragana: 'かんじ', romaji: 'kanji'

Here is an example that output as similar with furigana mode.

import pykakasi
kks = pykakasi.kakasi()
text = "かな漢字交じり文"
result = kks.convert(text)
for item in result:
    print("{}[{}] ".format(item['orig'], item['hepburn'].capitalize()), end='')
print()

かな[Kana] 漢字[Kanji] 交じり[Majiri] [Bun]

Benchmark result

You can see benchmark result on various versions and platforms at https://github.com/miurahr/pykakasi/issues/123

PyKakasi ChangeLog

All notable changes to this project will be documented in this file.

Unreleased

Added

  • dictionary: add noun and adjectives from UniDic(#140)

Changed

Fixed

  • Fix segmentation (wakati) when combination with Katakana and Hiragana(#142)

Deprecated

Removed

Security

v2.1.1 (16, May 2021)

Added

  • Provide Kakasi.normalize(text) class method

  • Add unidic data into data (not used yet), and add parse utility.

Fixed

  • Put type hint stub into package

  • Copyright notifications

Changed

  • Expand all cletter into dictionary (#139)

  • Change primary kanwadict index from str to int

  • test: gather all legacy test into test_pykakasi_legacy.py file.

v2.1.0 (6, May 2021)

Added

  • Deprecation warning when using old api(#124)

  • Add type hint file(pyi) (#124)

  • Benchmark test codes(#122)

Changed

  • Cache internal results and improve performance about 30-40 times.(#128)

  • Use standard pickle for database file(#128)

  • Exceptions module is now pykakasi, not pykakasi.exceptions

Removed

  • Dependency for klepto(#128)

v2.0.8 (4, May 2021)

Added

  • test: Benchmark and profiling (#122)

Changed

  • Performance: avoid ord() when checking long-mark, speed up about 6%

  • Reformat code by black(#121)

v2.0.7 (26, Feb. 2021)

Fixed

  • Infinite loop after running for a while, handle independent HW VOICED SOUND MARK (#115, #118)

v2.0.6 (7, Feb. 2021)

Fixed

  • Hiragana for Age countersa(#116,#117)

v2.0.5 (5, Feb. 2021)

Changed

  • CLI: use argparse for option parse(#113)

Fixed

  • Handle 思った、言った、行った properly.(#114)

  • CI: fix coveralls error

Deprecated

  • CI: drop travis-ci test and badge

v2.0.4 (26, Nov. 2020)

Fixed

  • CLI: Fix -v and -h option crash on python 3.7 and before (#108).

v2.0.3 (25, Nov. 2020)

Fixed

  • CLI: Fix -v and -h option crash (#108).

v2.0.2 (23, Jul. 2020)

Fixed

  • Fix convert() to handle Katakana correctly.(#103)

v2.0.1 (23, Jul. 2020)

Changed

  • Update setup.py, setup.cfg, tox.ini(#102)

Fixed

  • Fix convert() misses last part of a text (#99, #100)

  • Fix CI, coverage, and coveralls configurations(#101)

v2.0.0 (31, May. 2020)

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page