Corpora
This is a list of Toki Pona corpora and other sources that can be readily used as such, useful for linguistic analysis or learning (machine or otherwise).
List[edit | edit source]
Name | Size (MB) | Era | Authors | Fluency | Type | Parallel? | License | Notes |
---|---|---|---|---|---|---|---|---|
davidar's nltk-tp collection | <4.68[a] | ?–2017 | Many | Varies | Varies | some English | None | Contains:
|
Tatoeba | 2.35 | 2010– | Many | Varies | Translated sentences | English, German, many others | CC BY 2.0 FR | Data quality marked in reviews |
lipu tenpo | ~0.7 | 2021– | Some | High | Articles, poems | No | CC BY-SA 4.0 | Text can be extracted using the command-line utility pdftotext, though with imperfect formatting. Some articles are available in webpage form on their website. |
lipu kule | ? | 2021– | Some | High | Articles | No | CC BY-SA 4.0 | |
ante toki pona | ? | 2021– | Few | High | Translated fiction | English | CC BY 4.0 | |
Mozilla Common Voice | 0.24 | 2021– | Few | Varies | Sentences | No | CC0 1.0 | Contains:
|
kalama sin (transcripts) | ? | 2021– | Some | Varies | Spontaneous and scripted speech | No | CC BY-SA 4.0 |
Notes[edit | edit source]
- ↑ This includes non-Toki Pona data.
Audio | Audiobooks · Dubs · Podcasts · Music |
---|---|
Literature | Bibliography · Books · Audiobooks · Comics · Corpora · In sitelen pona · Zines (lipu kule · lipu monsuta · lipu tenpo) |
Constrained writing | Poetry (Formats) · Palindromes · Pangrams · Tongue twisters |
Games | Games · Minecraft |