Corpora

From sona pona, the Toki Pona wiki
(Redirected from corpus)

davidar’s metacorpus[edit | edit source]

  • type: varied
  • quality: varied (i’ve heard jan Kipo tampered with his corpus to conform to his idea of toki pona grammar)
  • dialect: mostly old/pu
  • size: 4675k but a bunch of it is in english or duplicated, gotta count it properly eventually
  • license: none, some under various cc licenses (247k, mostly by jan Kipo)
  • preprocessing: probably a lot
  • where: https://github.com/davidar/nltk-tp/tree/master/Corpus

note that this corpus contains the entirety of:

  • the TokiPonaTools corpus (104k)
  • the jan Kipo corpus [with english] (3281k)
  • the Little Prince translation [with english if you wanna look for it] (54k)

tatoeba [with english][edit | edit source]

  • type: single sentences, some with translations
  • quality: varied (can be filtered with Users' sentence reviews)
  • dialect: mostly pu
  • size: 1896k (as of 2021-06-12)
  • license: cc-by 2.5 fr
  • preprocessing: extract last column of tsv file, replace Ton and Mali/Mewi/Mawi with random names
  • where: https://tatoeba.org/en/downloads

audio recordings sorta available (under cc-by-sa 4.0) but you’ll have to scrape them yourself

wikipesija[edit | edit source]

ma pona screenplay contest[edit | edit source]

  • type: screenplay
  • quality: good, though grammar funnies are abound
  • dialect: ma ponian
  • size: 54.2k
    1. toki ala o e toki Inli - 1.5k
    2. nanpa pi kipisi ala - 0.5k (not even really toki pona, not worth it)
    3. tu kuntu - 35.6k
    4. mijomi telo - 10.somethingk (can’t copy first page for some reason)
    5. wi lon - 7.1k
  • license: none
  • preprocessing: extract 3 and 4 from pdf, uppercase names in 4, probably move names from header to somewhere else
  • where:
    1. toki ala o e toki Inli
    2. nanpa pi kipisi ala
    3. tu kuntu
    4. mijomi telo
    5. wi lon

lipu kule[edit | edit source]

lipu tenpo[edit | edit source]

jan Kita’s toki Ramble[edit | edit source]

ante toki pona [with english][edit | edit source]

includes pepper & carrot

kalama sin (transcribed by kon Itan)[edit | edit source]

[todo] library[edit | edit source]

https://docs.google.com/document/d/1IdMucmhPCzvoUF94Gp25XCwocWOl4PfQ_wfOkiU8cu8/edit?usp=sharing

[todo] jan Telakoman’s blog [with English][edit | edit source]

the english is a literal translation of the toki pona, with all the awkwardness that implies; this may or may not be desired

mozilla common voice[edit | edit source]

note that this corpus largely consists of these already listed sources:

  • jan Kita’s toki Ramble
  • ante toki pona (toki soweli)
  • tu kuntu
  • jan Sitata

audio recordings (of varying quality) available here: https://commonvoice.mozilla.org/en/datasets

jan Sitata [with english][edit | edit source]

english: https://www.gutenberg.org/files/2500/2500-h/2500-h.htm

[todo] davidar’s translations [with english][edit | edit source]

https://github.com/davidar/toki-ante

storyweaver [with english][edit | edit source]