Corpora: Difference between revisions

From sona pona, the Toki Pona wiki
Content added Content deleted
No edit summary
Tag: Reverted
No edit summary
 
(10 intermediate revisions by 6 users not shown)
Line 1: Line 1:
{{Needs work|Complete metadata}}
{{Extra license|[https://creativecommons.org/publicdomain/zero/1.0/ CC0 1.0]|it mostly consists of copyrighted data}} <!-- feel free to remove, i don't mind -->
This is a list of Toki Pona '''corpora''' and other sources that can be readily used as such, useful for linguistic analysis or learning (machine or otherwise).


==List==
{{Hatnote|This page was previously located at [https://pad.snopyta.org/lDb2EfOZQpmleu-ZktbDzg pad.snopyta.org].}}
{| class="wikitable sortable"
|+
!Name
!Size (MB)
!Era
!Authors
!Fluency
!Type
!Parallel?
!License
!Notes
|-
|[https://github.com/davidar/nltk-tp/tree/master/Corpus davidar's nltk-tp collection]
|<4.68<ref group="lower-alpha">This includes non-Toki Pona data.</ref>
|?–2017
|Many
|Varies
|Varies
|some English
|None
|Contains:
* {{tok|jan Kipo}}'s corpus (3.3 MB)
* Matthew Dean Martin's corpus (0.1 MB)
|-
|[[Tatoeba]]
|2.35
|2010–
|Many
|Varies
|Translated sentences
|English, German, many others
|[https://creativecommons.org/licenses/by/2.0/fr/ CC BY 2.0 FR]
|Data quality marked in reviews
|-
|{{tp|[[lipu tenpo]]}}
|~0.7
|2021–
|Some
|High
|Articles, poems
|No
|[https://creativecommons.org/licenses/by-sa/4.0/ CC BY-SA 4.0]
|Text can be extracted using the command-line utility [https://www.xpdfreader.com/pdftotext-man.html pdftotext], though with imperfect formatting. Some articles are available in webpage form [https://liputenpo.org/toki/ on their website].
|-
|{{tp|[[lipu kule]]}}
|{{N/A|?}}
|2021–
|Some
|High
|Articles
|No
|[https://creativecommons.org/licenses/by-sa/4.0/ CC BY-SA 4.0]
|
|-
|[http://antetokipona.infinityfreeapp.com/csv/ {{tp|ante toki pona}}]
|{{N/A|?}}
|2021–
|Few
|High
|Translated fiction
|English
|[https://creativecommons.org/licenses/by/4.0/ CC BY 4.0]
|
|-
|[https://github.com/common-voice/common-voice/blob/main/server/data/tok/sentence-collector.txt Mozilla Common Voice]
|0.24
|2021–
|Few
|Varies
|Sentences
|No
|[https://creativecommons.org/publicdomain/zero/1.0/ CC0 1.0]
|Contains:
* {{tp|[[tu kuntu]]}}
* {{tp|[[jan Sitata]]}}
* {{tok|jan Kita}}'s [https://github.com/Sobsz/toki-ramble {{tok|toki}} Ramble]
* Parts of {{tp|toki soweli}} (see also {{tp|ante toki pona}}).
|-
|{{tp|[[kalama sin]]}} ([[oldwikisource:kalama sin|transcripts]])
|{{N/A|?}}
|2021–
|Some
|Varies
|Spontaneous and scripted speech
|No
|[https://creativecommons.org/licenses/by-sa/4.0/ CC BY-SA 4.0]
|
|}


==Notes==
== davidar’s metacorpus ==
<references group="lower-alpha" />

{{Media}}
* type: varied
[[Category:Literature| ]]
* quality: varied (i’ve heard jan Kipo tampered with his corpus to conform to his idea of toki pona grammar)
* dialect: mostly old/pu
* size: 4675k ''but'' a bunch of it is in english or duplicated, gotta count it properly eventually
* license: none, some under various cc licenses (247k, mostly by jan Kipo)
* preprocessing: probably a lot
* where: https://github.com/davidar/nltk-tp/tree/master/Corpus

note that this corpus contains the entirety of:

* the [https://github.com/matthewdeanmartin/tokipona.parser/tree/master/TokiPonaTools/TokiPona/corpus/forums TokiPonaTools] corpus (104k)
* the jan Kipo corpus [with english] (3281k)
* the Little Prince translation [with english if you wanna look for it] (54k)

== tatoeba [with english] ==

* type: single sentences, some with translations
* quality: varied (can be filtered with <code>Users' sentence reviews</code>)
* dialect: mostly pu
* size: 1896k (as of 2021-06-12)
* license: cc-by 2.5 fr
* preprocessing: extract last column of tsv file, replace <code>Ton</code> and <code>Mali</code>/<code>Mewi</code>/<code>Mawi</code> with random names
* where: https://tatoeba.org/en/downloads

audio recordings sorta available (under cc-by-sa 4.0) but you’ll have to scrape them yourself

== wikipesija ==

* type: articles, some definitions, a bit of discussion in talk pages
* quality: varied
* dialect: varied
* size: 1084k raw, probably half? when stripped (as of 2021-06-13)
* license: cc-by-sa 3.0 (some 4.0)
* preprocessing: strip templates and markup
* where:
** https://archive.org/download/wiki-wikipesija.org
** https://wikipesija.org/wiki/ilo:Export (for chosen categories)

== ma pona screenplay contest ==

* type: screenplay
* quality: good, though grammar funnies are abound
* dialect: ma ponian
* size: 54.2k
*# toki ala o e toki Inli - 1.5k
*# nanpa pi kipisi ala - 0.5k (not even really toki pona, not worth it)
*# tu kuntu - 35.6k
*# mijomi telo - 10.somethingk (can’t copy first page for some reason)
*# wi lon - 7.1k
* license: none
* preprocessing: extract 3 and 4 from pdf, uppercase names in 4, probably move names from header to somewhere else
* where:
*# [https://docs.google.com/document/d/1W21rSjx2eyYLjcipFGcmLEa-nQenge7wzLk87Tq-CuE/edit toki ala o e toki Inli]
*# [https://docs.google.com/document/d/1DXcXoUm8vSAGsAtXuhhiMG36jAGgbLGXG6h4b9QrcrY/edit nanpa pi kipisi ala]
*# [https://drive.google.com/file/d/1fwvben0Uo3ddmhWBZEarWHt80ax9LQiK/view tu kuntu]
*# [https://drive.google.com/file/d/1wGSEiI3XlJ32YKeFRmp6U-HMKW96Ac_4/view mijomi telo]
*# [https://docs.google.com/document/d/1xl5osTAdUfP96ILzYaHpEnSDcxdDVKZ4t01Y8j9ul7w/edit wi lon]

== lipu kule ==

* type: essays
* quality: good
* dialect: ma ponian
* size: ~75k as of 2022-06-25
* license: cc-by-sa 4.0
* preprocessing: remove frontmatter, remove markdown
* where: https://github.com/lipukule/site/tree/main/content/tok/post

== lipu tenpo ==

* type: essays
* quality: good
* dialect: ma ponian
* size: ~400k as of 2022-06-25 (assuming all are similar to lipu tenpo nanpa pan (~25k per issue))
* license: cc-by-sa 4.0
* preprocessing: surprisingly easy thanks to [https://www.xpdfreader.com/pdftotext-man.html pdftotext], some manual checking still required though (e.g. nanpa pan has a sitelen pona thing)
* where: https://liputenpo.org/ (also: https://wikisource.org/wiki/Category:Lipu_tenpo)

== jan Kita’s toki Ramble ==

* type: essays with topic
* quality: depends on how much you detest my (edit: outdated) style
* dialect: very freeform ma ponian, [https://github.com/Sobsz/toki-pona/blob/master/kita.md some of it detailed here]
* size: 19.5k as of 2021-08-13
* license: cc0
* preprocessing: none, though maybe remove <code>toki nasa pi ike mi.txt</code>
* where: https://github.com/Sobsz/toki-ramble

== ante toki pona [with english] ==

* type: prose, dialogue
* quality: varied
* dialect: varied
* size: ~650k when raw as of 2021-06-29, far less once the empty parts are filtered out
* license: cc-by 4.0
* where: http://antetokipona.infinityfreeapp.com/csv/

includes pepper & carrot

== kalama sin (transcribed by kon Itan) ==

* type: spoken dialogue
* quality: as good as spontaneous speech can be
* dialect: varied, mostly ma ponian
* size: ~97k as of 2022-01-30 (assuming 50% formatting overhead)
* license: none that i know of
* preprocessing: convert from .srt to .txt (remove the first 2 lines after each blank line), remove speaker labels, deal with square brackets
* where: https://drive.google.com/drive/folders/12H2xY06Wtwh4V6zoPOhDnE_R4aZ5sVJp

== [todo] library ==

https://docs.google.com/document/d/1IdMucmhPCzvoUF94Gp25XCwocWOl4PfQ_wfOkiU8cu8/edit?usp=sharing

== [todo] jan Telakoman’s blog [with English] ==

* type: prose
* quality: good
* dialect:
* size:
* license:
* where: https://joelthomastr.github.io/tokipona/README_si; https://github.com/joelthomastr/tokipona

the english is a literal translation of the toki pona, with all the awkwardness that implies; this may or may not be desired

== mozilla common voice ==

* type: sentences
* quality: varied
* dialect: varied
* size: 230k as of 2022-06-27
* license: cc0
* where: https://github.com/common-voice/common-voice/blob/main/server/data/tok/sentence-collector.txt

note that this corpus largely consists of these already listed sources:
* jan Kita’s toki Ramble
* ante toki pona (toki soweli)
* tu kuntu
* jan Sitata

audio recordings (of varying quality) available here: https://commonvoice.mozilla.org/en/datasets

== jan Sitata [with english] ==

* type: book
* quality: good
* dialect: pu
* size: ~60k
* license: cc0, explicit consent for machine learning by jan Sonja ([https://discord.com/channels/301377942062366741/912286596517220363/991845727095509002 ma pona])
* where: https://tokipona.org/sitata/

english: https://www.gutenberg.org/files/2500/2500-h/2500-h.htm

== [todo] davidar’s translations [with english] ==

https://github.com/davidar/toki-ante

== storyweaver [with english] ==

* type: translations of children's books
* quality: varied
* dialect: varied
* size: 17 books as of 2022-11-23, rough estimate: ~10k
* license: cc-by 4.0
* preprocessing: good luck
* where: https://storyweaver.org.in/stories?language=Toki%20Pona
[[Category:Literature]]

Latest revision as of 19:07, 18 May 2024

Under construction This article needs work:

Complete metadata

If you know about this topic, you can help us by editing it. (See all)

This is a list of Toki Pona corpora and other sources that can be readily used as such, useful for linguistic analysis or learning (machine or otherwise).

List[edit | edit source]

Name Size (MB) Era Authors Fluency Type Parallel? License Notes
davidar's nltk-tp collection <4.68[a] ?–2017 Many Varies Varies some English None Contains:
  • jan Kipo's corpus (3.3 MB)
  • Matthew Dean Martin's corpus (0.1 MB)
Tatoeba 2.35 2010– Many Varies Translated sentences English, German, many others CC BY 2.0 FR Data quality marked in reviews
lipu tenpo ~0.7 2021– Some High Articles, poems No CC BY-SA 4.0 Text can be extracted using the command-line utility pdftotext, though with imperfect formatting. Some articles are available in webpage form on their website.
lipu kule ? 2021– Some High Articles No CC BY-SA 4.0
ante toki pona ? 2021– Few High Translated fiction English CC BY 4.0
Mozilla Common Voice 0.24 2021– Few Varies Sentences No CC0 1.0 Contains:
kalama sin (transcripts) ? 2021– Some Varies Spontaneous and scripted speech No CC BY-SA 4.0

Notes[edit | edit source]

  1. This includes non-Toki Pona data.