Corpora

From sona pona, the Toki Pona wiki
Under construction This article needs work:

complete metadata

If you know about this topic, you can help us by editing it. (See all)

The following is a list of Toki Pona corpora (or sources that can be readily used as such), useful for linguistic analysis or learning (machine or otherwise).

Name Size (MB) Era Authors Fluency Type Parallel? License Notes
davidar's nltk-tp collection <4.68[a] ?-2017 Many Varies Varies some English None Contains: jan Kipo's corpus (3.3 MB), Matthew Dean Martin's corpus (0.1 MB)
Tatoeba 2.35 2010- Many Varies Translated sentences English, German, many others CC BY 2.0 FR Data quality is marked in "Users' sentence reviews".
lipu tenpo ~0.7 2021- Some High Articles, poems No CC BY-SA 4.0 Text can be extracted using the command-line utility pdftotext, though with imperfect formatting. Some articles are available in webpage form here.
lipu kule ? 2021- Some High Articles No CC BY-SA 4.0
ante toki pona ? 2021- Few High Translated fiction English CC BY 4.0
Mozilla Common Voice 0.24 2021- Few Varies Sentences, some translated No CC0 1.0 Contains: tu kuntu, jan Sitata, jan Kita's toki Ramble, some of toki soweli (see ante toki pona).
Transcripts of kalama sin ? 2021- Some Varies Spontaneous and scripted speech No CC BY-SA 4.0
  1. This includes non-Toki Pona data.