Corpora
davidar’s metacorpus[edit | edit source]
- type: varied
- quality: varied (i’ve heard jan Kipo tampered with his corpus to conform to his idea of toki pona grammar)
- dialect: mostly old/pu
- size: 4675k but a bunch of it is in english or duplicated, gotta count it properly eventually
- license: none, some under various cc licenses (247k, mostly by jan Kipo)
- preprocessing: probably a lot
- where: https://github.com/davidar/nltk-tp/tree/master/Corpus
note that this corpus contains the entirety of:
- the TokiPonaTools corpus (104k)
- the jan Kipo corpus [with english] (3281k)
- the Little Prince translation [with english if you wanna look for it] (54k)
tatoeba [with english][edit | edit source]
- type: single sentences, some with translations
- quality: varied (can be filtered with
Users' sentence reviews
) - dialect: mostly pu
- size: 1896k (as of 2021-06-12)
- license: cc-by 2.5 fr
- preprocessing: extract last column of tsv file, replace
Ton
andMali
/Mewi
/Mawi
with random names - where: https://tatoeba.org/en/downloads
audio recordings sorta available (under cc-by-sa 4.0) but you’ll have to scrape them yourself
wikipesija[edit | edit source]
- type: articles, some definitions, a bit of discussion in talk pages
- quality: varied
- dialect: varied
- size: 1084k raw, probably half? when stripped (as of 2021-06-13)
- license: cc-by-sa 3.0 (some 4.0)
- preprocessing: strip templates and markup
- where:
- https://archive.org/download/wiki-wikipesija.org
- https://wikipesija.org/wiki/ilo:Export (for chosen categories)
ma pona screenplay contest[edit | edit source]
- type: screenplay
- quality: good, though grammar funnies are abound
- dialect: ma ponian
- size: 54.2k
- toki ala o e toki Inli - 1.5k
- nanpa pi kipisi ala - 0.5k (not even really toki pona, not worth it)
- tu kuntu - 35.6k
- mijomi telo - 10.somethingk (can’t copy first page for some reason)
- wi lon - 7.1k
- license: none
- preprocessing: extract 3 and 4 from pdf, uppercase names in 4, probably move names from header to somewhere else
- where:
lipu kule[edit | edit source]
- type: essays
- quality: good
- dialect: ma ponian
- size: ~75k as of 2022-06-25
- license: cc-by-sa 4.0
- preprocessing: remove frontmatter, remove markdown
- where: https://github.com/lipukule/site/tree/main/content/tok/post
lipu tenpo[edit | edit source]
- type: essays
- quality: good
- dialect: ma ponian
- size: ~400k as of 2022-06-25 (assuming all are similar to lipu tenpo nanpa pan (~25k per issue))
- license: cc-by-sa 4.0
- preprocessing: surprisingly easy thanks to pdftotext, some manual checking still required though (e.g. nanpa pan has a sitelen pona thing)
- where: https://liputenpo.org/ (also: https://wikisource.org/wiki/Category:Lipu_tenpo)
jan Kita’s toki Ramble[edit | edit source]
- type: essays with topic
- quality: depends on how much you detest my (edit: outdated) style
- dialect: very freeform ma ponian, some of it detailed here
- size: 19.5k as of 2021-08-13
- license: cc0
- preprocessing: none, though maybe remove
toki nasa pi ike mi.txt
- where: https://github.com/Sobsz/toki-ramble
ante toki pona [with english][edit | edit source]
- type: prose, dialogue
- quality: varied
- dialect: varied
- size: ~650k when raw as of 2021-06-29, far less once the empty parts are filtered out
- license: cc-by 4.0
- where: http://antetokipona.infinityfreeapp.com/csv/
includes pepper & carrot
kalama sin (transcribed by kon Itan)[edit | edit source]
- type: spoken dialogue
- quality: as good as spontaneous speech can be
- dialect: varied, mostly ma ponian
- size: ~97k as of 2022-01-30 (assuming 50% formatting overhead)
- license: none that i know of
- preprocessing: convert from .srt to .txt (remove the first 2 lines after each blank line), remove speaker labels, deal with square brackets
- where: https://drive.google.com/drive/folders/12H2xY06Wtwh4V6zoPOhDnE_R4aZ5sVJp
- clean version: https://cdn.discordapp.com/attachments/773267909903908894/1094680735069257798/file
[todo] library[edit | edit source]
https://docs.google.com/document/d/1IdMucmhPCzvoUF94Gp25XCwocWOl4PfQ_wfOkiU8cu8/edit?usp=sharing
[todo] jan Telakoman’s blog [with English][edit | edit source]
- type: prose
- quality: good
- dialect:
- size:
- license:
- where: https://joelthomastr.github.io/tokipona/README_si; https://github.com/joelthomastr/tokipona
the english is a literal translation of the toki pona, with all the awkwardness that implies; this may or may not be desired
mozilla common voice[edit | edit source]
- type: sentences
- quality: varied
- dialect: varied
- size: 230k as of 2022-06-27
- license: cc0
- where: https://github.com/common-voice/common-voice/blob/main/server/data/tok/sentence-collector.txt
note that this corpus largely consists of these already listed sources:
- jan Kita’s toki Ramble
- ante toki pona (toki soweli)
- tu kuntu
- jan Sitata
audio recordings (of varying quality) available here: https://commonvoice.mozilla.org/en/datasets
jan Sitata [with english][edit | edit source]
- type: book
- quality: good
- dialect: pu
- size: ~60k
- license: cc0, explicit consent for machine learning by jan Sonja (ma pona)
- where: https://tokipona.org/sitata/
english: https://www.gutenberg.org/files/2500/2500-h/2500-h.htm
[todo] davidar’s translations [with english][edit | edit source]
https://github.com/davidar/toki-ante
storyweaver [with english][edit | edit source]
- type: translations of children's books
- quality: varied
- dialect: varied
- size: 17 books as of 2022-11-23, rough estimate: ~10k
- license: cc-by 4.0
- preprocessing: good luck
- where: https://storyweaver.org.in/stories?language=Toki%20Pona