Corpora

    From sona pona

    davidar’s metacorpus[edit | edit source]

    • type: varied
    • quality: varied (i’ve heard jan Kipo tampered with his corpus to conform to his idea of toki pona grammar)
    • dialect: mostly old/pu
    • size: 4675k but a bunch of it is in english or duplicated, gotta count it properly eventually
    • license: none, some under various cc licenses (247k, mostly by jan Kipo)
    • preprocessing: probably a lot
    • where: https://github.com/davidar/nltk-tp/tree/master/Corpus

    note that this corpus contains the entirety of:

    • the TokiPonaTools corpus (104k)
    • the jan Kipo corpus [with english] (3281k)
    • the Little Prince translation [with english if you wanna look for it] (54k)

    tatoeba [with english][edit | edit source]

    • type: single sentences, some with translations
    • quality: varied (can be filtered with Users' sentence reviews)
    • dialect: mostly pu
    • size: 1896k (as of 2021-06-12)
    • license: cc-by 2.5 fr
    • preprocessing: extract last column of tsv file, replace Ton and Mali/Mewi/Mawi with random names
    • where: https://tatoeba.org/en/downloads

    audio recordings sorta available (under cc-by-sa 4.0) but you’ll have to scrape them yourself

    wikipesija[edit | edit source]

    ma pona screenplay contest[edit | edit source]

    • type: screenplay
    • quality: good, though grammar funnies are abound
    • dialect: ma ponian
    • size: 54.2k
      1. toki ala o e toki Inli - 1.5k
      2. nanpa pi kipisi ala - 0.5k (not even really toki pona, not worth it)
      3. tu kuntu - 35.6k
      4. mijomi telo - 10.somethingk (can’t copy first page for some reason)
      5. wi lon - 7.1k
    • license: none
    • preprocessing: extract 3 and 4 from pdf, uppercase names in 4, probably move names from header to somewhere else
    • where:
      1. toki ala o e toki Inli
      2. nanpa pi kipisi ala
      3. tu kuntu
      4. mijomi telo
      5. wi lon

    lipu kule[edit | edit source]

    lipu tenpo[edit | edit source]

    jan Kita’s toki Ramble[edit | edit source]

    ante toki pona [with english][edit | edit source]

    includes pepper & carrot

    kalama sin (transcribed by kon Itan)[edit | edit source]

    [todo] library[edit | edit source]

    https://docs.google.com/document/d/1IdMucmhPCzvoUF94Gp25XCwocWOl4PfQ_wfOkiU8cu8/edit?usp=sharing

    [todo] jan Telakoman’s blog [with English][edit | edit source]

    the english is a literal translation of the toki pona, with all the awkwardness that implies; this may or may not be desired

    mozilla common voice[edit | edit source]

    note that this corpus largely consists of these already listed sources:

    • jan Kita’s toki Ramble
    • ante toki pona (toki soweli)
    • tu kuntu
    • jan Sitata

    audio recordings (of varying quality) available here: https://commonvoice.mozilla.org/en/datasets

    jan Sitata [with english][edit | edit source]

    english: https://www.gutenberg.org/files/2500/2500-h/2500-h.htm

    [todo] davidar’s translations [with english][edit | edit source]

    https://github.com/davidar/toki-ante

    storyweaver [with english][edit | edit source]