Corpora

From sona pona, the Toki Pona wiki
(Redirected from corpora)
Under construction This article needs work:

Complete metadata

If you know about this topic, you can help us by editing it. (See all)

This is a list of Toki Pona corpora and other sources that can be readily used as such, useful for linguistic analysis or learning (machine or otherwise).

List[edit | edit source]

Name Size (MB) Era Authors Fluency Type Parallel? License Notes
davidar's nltk-tp collection <4.68[a] ?–2017 Many Varies Varies some English None Contains:
  • jan Kipo's corpus (3.3 MB)
  • Matthew Dean Martin's corpus (0.1 MB)
Tatoeba 2.35 2010– Many Varies Translated sentences English, German, many others CC BY 2.0 FR Data quality marked in reviews
lipu tenpo ~0.7 2021– Some High Articles, poems No CC BY-SA 4.0 Text can be extracted using the command-line utility pdftotext, though with imperfect formatting. Some articles are available in webpage form on their website.
lipu kule ? 2021– Some High Articles No CC BY-SA 4.0
ante toki pona ? 2021– Few High Translated fiction English CC BY 4.0
Mozilla Common Voice 0.24 2021– Few Varies Sentences No CC0 1.0 Contains:
kalama sin (transcripts) ? 2021– Some Varies Spontaneous and scripted speech No CC BY-SA 4.0

Notes[edit | edit source]

  1. This includes non-Toki Pona data.