Corpora: Difference between revisions

From sona pona, the Toki Pona wiki
Content added Content deleted
m (Reverted edits by Mods are asleep qazw (talk) to last revision by Jan Ke Tami)
Tag: Rollback
(added a clean version of the kalama sin transcript)
Line 113: Line 113:
* preprocessing: convert from .srt to .txt (remove the first 2 lines after each blank line), remove speaker labels, deal with square brackets
* preprocessing: convert from .srt to .txt (remove the first 2 lines after each blank line), remove speaker labels, deal with square brackets
* where: https://drive.google.com/drive/folders/12H2xY06Wtwh4V6zoPOhDnE_R4aZ5sVJp
* where: https://drive.google.com/drive/folders/12H2xY06Wtwh4V6zoPOhDnE_R4aZ5sVJp
* clean version: https://cdn.discordapp.com/attachments/773267909903908894/1094680735069257798/file


== [todo] library ==
== [todo] library ==

Revision as of 18:09, 9 April 2023

davidar’s metacorpus

  • type: varied
  • quality: varied (i’ve heard jan Kipo tampered with his corpus to conform to his idea of toki pona grammar)
  • dialect: mostly old/pu
  • size: 4675k but a bunch of it is in english or duplicated, gotta count it properly eventually
  • license: none, some under various cc licenses (247k, mostly by jan Kipo)
  • preprocessing: probably a lot
  • where: https://github.com/davidar/nltk-tp/tree/master/Corpus

note that this corpus contains the entirety of:

  • the TokiPonaTools corpus (104k)
  • the jan Kipo corpus [with english] (3281k)
  • the Little Prince translation [with english if you wanna look for it] (54k)

tatoeba [with english]

  • type: single sentences, some with translations
  • quality: varied (can be filtered with Users' sentence reviews)
  • dialect: mostly pu
  • size: 1896k (as of 2021-06-12)
  • license: cc-by 2.5 fr
  • preprocessing: extract last column of tsv file, replace Ton and Mali/Mewi/Mawi with random names
  • where: https://tatoeba.org/en/downloads

audio recordings sorta available (under cc-by-sa 4.0) but you’ll have to scrape them yourself

wikipesija

ma pona screenplay contest

  • type: screenplay
  • quality: good, though grammar funnies are abound
  • dialect: ma ponian
  • size: 54.2k
    1. toki ala o e toki Inli - 1.5k
    2. nanpa pi kipisi ala - 0.5k (not even really toki pona, not worth it)
    3. tu kuntu - 35.6k
    4. mijomi telo - 10.somethingk (can’t copy first page for some reason)
    5. wi lon - 7.1k
  • license: none
  • preprocessing: extract 3 and 4 from pdf, uppercase names in 4, probably move names from header to somewhere else
  • where:
    1. toki ala o e toki Inli
    2. nanpa pi kipisi ala
    3. tu kuntu
    4. mijomi telo
    5. wi lon

lipu kule

lipu tenpo

jan Kita’s toki Ramble

ante toki pona [with english]

includes pepper & carrot

kalama sin (transcribed by kon Itan)

[todo] library

https://docs.google.com/document/d/1IdMucmhPCzvoUF94Gp25XCwocWOl4PfQ_wfOkiU8cu8/edit?usp=sharing

[todo] jan Telakoman’s blog [with English]

the english is a literal translation of the toki pona, with all the awkwardness that implies; this may or may not be desired

mozilla common voice

note that this corpus largely consists of these already listed sources:

  • jan Kita’s toki Ramble
  • ante toki pona (toki soweli)
  • tu kuntu
  • jan Sitata

audio recordings (of varying quality) available here: https://commonvoice.mozilla.org/en/datasets

jan Sitata [with english]

english: https://www.gutenberg.org/files/2500/2500-h/2500-h.htm

[todo] davidar’s translations [with english]

https://github.com/davidar/toki-ante

storyweaver [with english]