Corpora: Difference between revisions

No edit summary
No edit summary
 
(18 intermediate revisions by 6 users not shown)
Line 1: Line 1:
{{Needs work|Complete metadata}}
{{Extra license|[https://creativecommons.org/publicdomain/zero/1.0/ CC0 1.0]|it mostly consists of uncopyrightable data}} <!-- feel free to remove, i don't mind -->
This is a list of Toki Pona '''corpora''' and other sources that can be readily used as such, useful for linguistic analysis or learning (machine or otherwise).


==List==
:This page was previously located at [https://pad.snopyta.org/lDb2EfOZQpmleu-ZktbDzg pad.snopyta.org].
{| class="wikitable sortable"
|+
!Name
!Size (MB)
!Era
!Authors
!Fluency
!Type
!Parallel?
!License
!Notes
|-
|[https://github.com/davidar/nltk-tp/tree/master/Corpus davidar's nltk-tp collection]
|<4.68<ref group="lower-alpha">This includes non-Toki Pona data.</ref>
|?–2017
|Many
|Varies
|Varies
|some English
|None
|Contains:
* {{tok|jan Kipo}}'s corpus (3.3 MB)
* Matthew Dean Martin's corpus (0.1 MB)
|-
|[[Tatoeba]]
|2.35
|2010–
|Many
|Varies
|Translated sentences
|English, German, many others
|[https://creativecommons.org/licenses/by/2.0/fr/ CC BY 2.0 FR]
|Data quality marked in reviews
|-
|{{tp|[[lipu tenpo]]}}
|~0.7
|2021–
|Some
|High
|Articles, poems
|No
|[https://creativecommons.org/licenses/by-sa/4.0/ CC BY-SA 4.0]
|Text can be extracted using the command-line utility [https://www.xpdfreader.com/pdftotext-man.html pdftotext], though with imperfect formatting. Some articles are available in webpage form [https://liputenpo.org/toki/ on their website].
|-
|{{tp|[[lipu kule]]}}
|{{N/A|?}}
|2021–
|Some
|High
|Articles
|No
|[https://creativecommons.org/licenses/by-sa/4.0/ CC BY-SA 4.0]
|
|-
|[http://antetokipona.infinityfreeapp.com/csv/ {{tp|ante toki pona}}]
|{{N/A|?}}
|2021–
|Few
|High
|Translated fiction
|English
|[https://creativecommons.org/licenses/by/4.0/ CC BY 4.0]
|
|-
|[https://github.com/common-voice/common-voice/blob/main/server/data/tok/sentence-collector.txt Mozilla Common Voice]
|0.24
|2021–
|Few
|Varies
|Sentences
|No
|[https://creativecommons.org/publicdomain/zero/1.0/ CC0 1.0]
|Contains:
* {{tp|[[tu kuntu]]}}
* {{tp|[[jan Sitata]]}}
* {{tok|jan Kita}}'s [https://github.com/Sobsz/toki-ramble {{tok|toki}} Ramble]
* Parts of {{tp|toki soweli}} (see also {{tp|ante toki pona}}).
|-
|{{tp|[[kalama sin]]}} ([[oldwikisource:kalama sin|transcripts]])
|{{N/A|?}}
|2021–
|Some
|Varies
|Spontaneous and scripted speech
|No
|[https://creativecommons.org/licenses/by-sa/4.0/ CC BY-SA 4.0]
|
|}


==Notes==
== davidar’s metacorpus ==
<references group="lower-alpha" />

{{Media}}
* type: varied
[[Category:Literature| ]]
* quality: varied (i’ve heard jan Kipo tampered with his corpus to conform to his idea of toki pona grammar)
* dialect: mostly old/pu
* size: 4675k ''but'' a bunch of it is in english or duplicated, gotta count it properly eventually
* license: none, some under various cc licenses (247k, mostly by jan Kipo)
* preprocessing: probably a lot
* where: https://github.com/davidar/nltk-tp/tree/master/Corpus

note that this corpus contains the entirety of:

* the [https://github.com/matthewdeanmartin/tokipona.parser/tree/master/TokiPonaTools/TokiPona/corpus/forums TokiPonaTools] corpus (104k)
* the jan Kipo corpus [with english] (3281k)
* the Little Prince translation [with english if you wanna look for it] (54k)

== tatoeba [with english] ==

* type: single sentences, some with translations
* quality: varied (can be filtered with <code>Users' sentence reviews</code>)
* dialect: mostly pu
* size: 1896k (as of 2021-06-12)
* license: cc-by 2.5 fr
* preprocessing: extract last column of tsv file, replace <code>Ton</code> and <code>Mali</code>/<code>Mewi</code>/<code>Mawi</code> with random names
* where: https://tatoeba.org/en/downloads

audio recordings sorta available (under cc-by-sa 4.0) but you’ll have to scrape them yourself

== wikipesija ==

* type: articles, some definitions, a bit of discussion in talk pages
* quality: varied
* dialect: varied
* size: 1084k raw, probably half? when stripped (as of 2021-06-13)
* license: cc-by-sa 3.0 (some 4.0)
* preprocessing: strip templates and markup
* where:
** https://archive.org/download/wiki-wikipesija.org
** https://wikipesija.org/wiki/ilo:Export (for chosen categories)

== ma pona screenplay contest ==

* type: screenplay
* quality: good, though grammar funnies are abound
* dialect: ma ponian
* size: 54.2k
*# toki ala o e toki Inli - 1.5k
*# nanpa pi kipisi ala - 0.5k (not even really toki pona, not worth it)
*# tu kuntu - 35.6k
*# mijomi telo - 10.somethingk (can’t copy first page for some reason)
*# wi lon - 7.1k
* license: none
* preprocessing: extract 3 and 4 from pdf, uppercase names in 4, probably move names from header to somewhere else
* where:
*# [https://docs.google.com/document/d/1W21rSjx2eyYLjcipFGcmLEa-nQenge7wzLk87Tq-CuE/edit toki ala o e toki Inli]
*# [https://docs.google.com/document/d/1DXcXoUm8vSAGsAtXuhhiMG36jAGgbLGXG6h4b9QrcrY/edit nanpa pi kipisi ala]
*# [https://drive.google.com/file/d/1fwvben0Uo3ddmhWBZEarWHt80ax9LQiK/view tu kuntu]
*# [https://drive.google.com/file/d/1wGSEiI3XlJ32YKeFRmp6U-HMKW96Ac_4/view mijomi telo]
*# [https://docs.google.com/document/d/1xl5osTAdUfP96ILzYaHpEnSDcxdDVKZ4t01Y8j9ul7w/edit wi lon]

== lipu kule ==

* type: essays
* quality: good
* dialect: ma ponian
* size: ~75k as of 2022-06-25
* license: cc-by-sa 4.0
* preprocessing: remove frontmatter, remove markdown
* where: https://github.com/lipukule/site/tree/main/content/tok/post

== lipu tenpo ==

* type: essays
* quality: good
* dialect: ma ponian
* size: ~325k as of 2022-06-25 (assuming all are similar to lipu tenpo nanpa pan (~25k per issue))
* license: cc-by-sa 4.0
* preprocessing: copy text, replace newlines with spaces, normalize punctuation, remove asterisks, some manual checking (nanpa pan has a sitelen pona thing)
* where: https://liputenpo.org/ (also: https://wikisource.org/wiki/Category:Lipu_tenpo)

== jan Kita’s toki Ramble ==

* type: essays with topic
* quality: depends on how much you detest my (edit: outdated) style
* dialect: very freeform ma ponian, [https://github.com/Sobsz/toki-pona/blob/master/kita.md some of it detailed here]
* size: 19.5k as of 2021-08-13
* license: cc0
* preprocessing: none, though maybe remove <code>toki nasa pi ike mi.txt</code>
* where: https://github.com/Sobsz/toki-ramble

== ante toki pona [with english] ==

* type: prose, dialogue
* quality: varied
* dialect: varied
* size: ~650k when raw as of 2021-06-29, far less once the empty parts are filtered out
* license: cc-by 4.0
* where: http://antetokipona.infinityfreeapp.com/csv/

includes pepper & carrot

== kalama sin (transcribed by kon Itan) ==

* type: spoken dialogue
* quality: as good as spontaneous speech can be
* dialect: varied, mostly ma ponian
* size: ~97k as of 2022-01-30 (assuming 50% formatting overhead)
* license: none that i know of
* preprocessing: convert from .srt to .txt (remove the first 2 lines after each blank line), remove speaker labels, deal with square brackets
* where: https://drive.google.com/drive/folders/12H2xY06Wtwh4V6zoPOhDnE_R4aZ5sVJp

== [todo] library ==

https://docs.google.com/document/d/1IdMucmhPCzvoUF94Gp25XCwocWOl4PfQ_wfOkiU8cu8/edit?usp=sharing

== [todo] jan Telakoman’s blog [with English] ==

* type: prose
* quality: good
* dialect:
* size:
* license:
* where: https://joelthomastr.github.io/tokipona/README_si; https://github.com/joelthomastr/tokipona

the english is a literal translation of the toki pona, with all the awkwardness that implies; this may or may not be desired

== mozilla common voice ==

* type: sentences
* quality: varied
* dialect: varied
* size: 230k as of 2022-06-27
* license: cc0
* where: https://github.com/common-voice/common-voice/blob/main/server/data/tok/sentence-collector.txt

note that this corpus largely consists of these already listed sources:
* jan Kita’s toki Ramble
* ante toki pona (toki soweli)
* tu kuntu
* jan Sitata

audio recordings (of varying quality) available here: https://commonvoice.mozilla.org/en/datasets

== jan Sitata [with english] ==

* type: book
* quality: good
* dialect: pu
* size: ~60k
* license: cc0, explicit consent for machine learning by jan Sonja ([https://discord.com/channels/301377942062366741/912286596517220363/991845727095509002 ma pona])
* where: https://tokipona.org/sitata/

english: https://www.gutenberg.org/files/2500/2500-h/2500-h.htm

== [todo] davidar’s translations [with english] ==

https://github.com/davidar/toki-ante

Latest revision as of 19:07, 18 May 2024

Under construction This article needs work:

Complete metadata

If you know about this topic, you can help us by editing it. (See all)

This is a list of Toki Pona corpora and other sources that can be readily used as such, useful for linguistic analysis or learning (machine or otherwise).

List Edit

Name Size (MB) Era Authors Fluency Type Parallel? License Notes
davidar's nltk-tp collection <4.68[a] ?–2017 Many Varies Varies some English None Contains:
  • jan Kipo's corpus (3.3 MB)
  • Matthew Dean Martin's corpus (0.1 MB)
Tatoeba 2.35 2010– Many Varies Translated sentences English, German, many others CC BY 2.0 FR Data quality marked in reviews
lipu tenpo ~0.7 2021– Some High Articles, poems No CC BY-SA 4.0 Text can be extracted using the command-line utility pdftotext, though with imperfect formatting. Some articles are available in webpage form on their website.
lipu kule ? 2021– Some High Articles No CC BY-SA 4.0
ante toki pona ? 2021– Few High Translated fiction English CC BY 4.0
Mozilla Common Voice 0.24 2021– Few Varies Sentences No CC0 1.0 Contains:
kalama sin (transcripts) ? 2021– Some Varies Spontaneous and scripted speech No CC BY-SA 4.0

Notes Edit

  1. This includes non-Toki Pona data.