Corpora
The following is a list of Toki Pona corpora (or sources that can be readily used as such), useful for linguistic analysis or learning (machine or otherwise).
Name | Size (MB) | Era | Authors | Fluency | Type | Parallel? | License | Notes |
---|---|---|---|---|---|---|---|---|
davidar's nltk-tp collection | <4.68[a] | ?-2017 | Many | Varies | Varies | some English | None | Contains: jan Kipo's corpus (3.3 MB), Matthew Dean Martin's corpus (0.1 MB) |
Tatoeba | 2.35 | 2010- | Many | Varies | Translated sentences | English, German, many others | CC BY 2.0 FR | Data quality is marked in "Users' sentence reviews". |
lipu tenpo | ~0.7 | 2021- | Some | High | Articles, poems | No | CC BY-SA 4.0 | Text can be extracted using the command-line utility pdftotext, though with imperfect formatting. Some articles are available in webpage form here. |
lipu kule | ? | 2021- | Some | High | Articles | No | CC BY-SA 4.0 | |
ante toki pona | ? | 2021- | Few | High | Translated fiction | English | CC BY 4.0 | |
Mozilla Common Voice | 0.24 | 2021- | Few | Varies | Sentences, some translated | No | CC0 1.0 | Contains: tu kuntu, jan Sitata, jan Kita's toki Ramble, some of toki soweli (see ante toki pona). |
Transcripts of kalama sin | ? | 2021- | Some | Varies | Spontaneous and scripted speech | No | CC BY-SA 4.0 |
- ↑ This includes non-Toki Pona data.
Audio | Audiobooks · Dubs · Podcasts · Music |
---|---|
Literature | Bibliography · Books · Audiobooks · Comics · Corpora · In sitelen pona · Zines (lipu kule · lipu monsuta · lipu tenpo) |
Constrained writing | Poetry (Formats) · Palindromes · Pangrams · Tongue twisters |
Games | Games · Minecraft |