Corpora: Difference between revisions

From sona pona, the Toki Pona wiki
Content added Content deleted
No edit summary
(tableification, trimming)
Tag: 2017 source edit
Line 1: Line 1:
{{Needs work|complete metadata}}
{{Extra license|[https://creativecommons.org/publicdomain/zero/1.0/ CC0 1.0]|it mostly consists of uncopyrightable data}} <!-- feel free to remove, i don't mind -->
The following is a list of Toki Pona '''corpora''' (or sources that can be readily used as such), useful for linguistic analysis or learning (machine or otherwise).

{| class="wikitable sortable"
{{Hatnote|This page was previously located at [https://pad.snopyta.org/lDb2EfOZQpmleu-ZktbDzg pad.snopyta.org].}}
|+

!Name
== davidar’s metacorpus ==
!Size (MB)

!Era
* type: varied
!Authors
* quality: varied (i’ve heard jan Kipo tampered with his corpus to conform to his idea of toki pona grammar)
!Fluency
* dialect: mostly old/pu
!Type
* size: 4675k ''but'' a bunch of it is in english or duplicated, gotta count it properly eventually
!Parallel?
* license: none, some under various cc licenses (247k, mostly by jan Kipo)
!License
* preprocessing: probably a lot
!Notes
* where: https://github.com/davidar/nltk-tp/tree/master/Corpus
|-

|[https://github.com/davidar/nltk-tp/tree/master/Corpus davidar's nltk-tp collection]
note that this corpus contains the entirety of:
|<4.68<ref group="lower-alpha">This includes non-Toki Pona data.</ref>

|?-2017
* the [https://github.com/matthewdeanmartin/tokipona.parser/tree/master/TokiPonaTools/TokiPona/corpus/forums TokiPonaTools] corpus (104k)
|Many
* the jan Kipo corpus [with english] (3281k)
|Varies
* the Little Prince translation [with english if you wanna look for it] (54k)
|Varies

|some English
== tatoeba [with english] ==
|None

|Contains: {{tok|jan Kipo}}'s corpus (3.3 MB), Matthew Dean Martin's corpus (0.1 MB)
* type: single sentences, some with translations
|-
* quality: varied (can be filtered with <code>Users' sentence reviews</code>)
|[[Tatoeba]]
* dialect: mostly pu
|2.35
* size: 1896k (as of 2021-06-12)
|2010-
* license: cc-by 2.5 fr
|Many
* preprocessing: extract last column of tsv file, replace <code>Ton</code> and <code>Mali</code>/<code>Mewi</code>/<code>Mawi</code> with random names
|Varies
* where: https://tatoeba.org/en/downloads
|Translated sentences

|English, German, many others
audio recordings sorta available (under cc-by-sa 4.0) but you’ll have to scrape them yourself
|[https://creativecommons.org/licenses/by/2.0/fr/ CC BY 2.0 FR]

|Data quality is marked in "Users' sentence reviews".
== wikipesija ==
|-

|{{tp|[[lipu tenpo]]}}
* type: articles, some definitions, a bit of discussion in talk pages
|~0.7
* quality: varied
|2021-
* dialect: varied
|Some
* size: 1084k raw, probably half? when stripped (as of 2021-06-13)
|High
* license: cc-by-sa 3.0 (some 4.0)
|Articles, poems
* preprocessing: strip templates and markup
|No
* where:
** https://archive.org/download/wiki-wikipesija.org
|[https://creativecommons.org/licenses/by-sa/4.0/ CC BY-SA 4.0]
|Text can be extracted using the command-line utility [https://www.xpdfreader.com/pdftotext-man.html pdftotext], though with imperfect formatting. Some articles are available in webpage form [https://liputenpo.org/toki/ here].
** https://wikipesija.org/wiki/ilo:Export (for chosen categories)
|-

|{{tp|[[lipu kule]]}}
== ma pona screenplay contest ==
|?

|2021-
* type: screenplay
|Some
* quality: good, though grammar funnies are abound
|High
* dialect: ma ponian
|Articles
* size: 54.2k
|No
*# toki ala o e toki Inli - 1.5k
|[https://creativecommons.org/licenses/by-sa/4.0/ CC BY-SA 4.0]
*# nanpa pi kipisi ala - 0.5k (not even really toki pona, not worth it)
|
*# tu kuntu - 35.6k
|-
*# mijomi telo - 10.somethingk (can’t copy first page for some reason)
|[http://antetokipona.infinityfreeapp.com/csv/ {{tp|ante toki pona}}]
*# wi lon - 7.1k
|?
* license: none
|2021-
* preprocessing: extract 3 and 4 from pdf, uppercase names in 4, probably move names from header to somewhere else
|Few
* where:
|High
*# [https://docs.google.com/document/d/1W21rSjx2eyYLjcipFGcmLEa-nQenge7wzLk87Tq-CuE/edit toki ala o e toki Inli]
|Translated fiction
*# [https://docs.google.com/document/d/1DXcXoUm8vSAGsAtXuhhiMG36jAGgbLGXG6h4b9QrcrY/edit nanpa pi kipisi ala]
|English
*# [https://drive.google.com/file/d/1fwvben0Uo3ddmhWBZEarWHt80ax9LQiK/view tu kuntu]
|[https://creativecommons.org/licenses/by/4.0/ CC BY 4.0]
*# [https://drive.google.com/file/d/1wGSEiI3XlJ32YKeFRmp6U-HMKW96Ac_4/view mijomi telo]
|
*# [https://docs.google.com/document/d/1xl5osTAdUfP96ILzYaHpEnSDcxdDVKZ4t01Y8j9ul7w/edit wi lon]
|-

|[https://github.com/common-voice/common-voice/blob/main/server/data/tok/sentence-collector.txt Mozilla Common Voice]
== lipu kule ==
|0.24

|2021-
* type: essays
|Few
* quality: good
|Varies
* dialect: ma ponian
|Sentences, some translated
* size: ~75k as of 2022-06-25
|No
* license: cc-by-sa 4.0
|[https://creativecommons.org/publicdomain/zero/1.0/ CC0 1.0]
* preprocessing: remove frontmatter, remove markdown
|Contains: {{tp|[[tu kuntu]]}}, {{tp|[[jan Sitata]]}}, {{tok|jan Kita}}'s [https://github.com/Sobsz/toki-ramble {{tok|toki}} Ramble], some of {{tp|toki soweli}} (see {{tp|ante toki pona}}).
* where: https://github.com/lipukule/site/tree/main/content/tok/post
|-

|[https://drive.google.com/drive/folders/12H2xY06Wtwh4V6zoPOhDnE_R4aZ5sVJp Transcripts] of {{tp|[[kalama sin]]}}
== lipu tenpo ==
|?

|2021-
* type: essays
|Some
* quality: good
|Varies
* dialect: ma ponian
|Spontaneous and scripted speech
* size: ~400k as of 2022-06-25 (assuming all are similar to lipu tenpo nanpa pan (~25k per issue))
|No
* license: cc-by-sa 4.0
|[https://creativecommons.org/licenses/by-sa/4.0/ CC BY-SA 4.0]
* preprocessing: surprisingly easy thanks to [https://www.xpdfreader.com/pdftotext-man.html pdftotext], some manual checking still required though (e.g. nanpa pan has a sitelen pona thing)
|
* where: https://liputenpo.org/ (also: https://wikisource.org/wiki/Category:Lipu_tenpo)
|}

<references group="lower-alpha" />
== jan Kita’s toki Ramble ==

* type: essays with topic
* quality: depends on how much you detest my (edit: outdated) style
* dialect: very freeform ma ponian, [https://github.com/Sobsz/toki-pona/blob/master/kita.md some of it detailed here]
* size: 19.5k as of 2021-08-13
* license: cc0
* preprocessing: none, though maybe remove <code>toki nasa pi ike mi.txt</code>
* where: https://github.com/Sobsz/toki-ramble

== ante toki pona [with english] ==

* type: prose, dialogue
* quality: varied
* dialect: varied
* size: ~650k when raw as of 2021-06-29, far less once the empty parts are filtered out
* license: cc-by 4.0
* where: http://antetokipona.infinityfreeapp.com/csv/

includes pepper & carrot

== kalama sin (transcribed by kon Itan) ==

* type: spoken dialogue
* quality: as good as spontaneous speech can be
* dialect: varied, mostly ma ponian
* size: ~97k as of 2022-01-30 (assuming 50% formatting overhead)
* license: none that i know of
* preprocessing: convert from .srt to .txt (remove the first 2 lines after each blank line), remove speaker labels, deal with square brackets
* where: https://drive.google.com/drive/folders/12H2xY06Wtwh4V6zoPOhDnE_R4aZ5sVJp
* clean version: https://cdn.discordapp.com/attachments/773267909903908894/1094680735069257798/file {{dead link}}

== [todo] library ==

https://docs.google.com/document/d/1IdMucmhPCzvoUF94Gp25XCwocWOl4PfQ_wfOkiU8cu8/edit?usp=sharing

== [todo] jan Telakoman’s blog [with English] ==

* type: prose
* quality: good
* dialect:
* size:
* license:
* where: https://joelthomastr.github.io/tokipona/README_si; https://github.com/joelthomastr/tokipona

the english is a literal translation of the toki pona, with all the awkwardness that implies; this may or may not be desired

== mozilla common voice ==

* type: sentences
* quality: varied
* dialect: varied
* size: 230k as of 2022-06-27
* license: cc0
* where: https://github.com/common-voice/common-voice/blob/main/server/data/tok/sentence-collector.txt

note that this corpus largely consists of these already listed sources:
* jan Kita’s toki Ramble
* ante toki pona (toki soweli)
* tu kuntu
* jan Sitata

audio recordings (of varying quality) available here: https://commonvoice.mozilla.org/en/datasets

== jan Sitata [with english] ==

* type: book
* quality: good
* dialect: pu
* size: ~60k
* license: cc0, explicit consent for machine learning by jan Sonja ([https://discord.com/channels/301377942062366741/912286596517220363/991845727095509002 ma pona])
* where: https://tokipona.org/sitata/

english: https://www.gutenberg.org/files/2500/2500-h/2500-h.htm

== [todo] davidar’s translations [with english] ==

https://github.com/davidar/toki-ante

== storyweaver [with english] ==

* type: translations of children's books
* quality: varied
* dialect: varied
* size: 17 books as of 2022-11-23, rough estimate: ~10k
* license: cc-by 4.0
* preprocessing: good luck
* where: https://storyweaver.org.in/stories?language=Toki%20Pona
{{Media}}
{{Media}}
[[Category:Literature| ]]
[[Category:Literature| ]]

Revision as of 20:21, 4 April 2024

Under construction This article needs work:

complete metadata

If you know about this topic, you can help us by editing it. (See all)

The following is a list of Toki Pona corpora (or sources that can be readily used as such), useful for linguistic analysis or learning (machine or otherwise).

Name Size (MB) Era Authors Fluency Type Parallel? License Notes
davidar's nltk-tp collection <4.68[a] ?-2017 Many Varies Varies some English None Contains: jan Kipo's corpus (3.3 MB), Matthew Dean Martin's corpus (0.1 MB)
Tatoeba 2.35 2010- Many Varies Translated sentences English, German, many others CC BY 2.0 FR Data quality is marked in "Users' sentence reviews".
lipu tenpo ~0.7 2021- Some High Articles, poems No CC BY-SA 4.0 Text can be extracted using the command-line utility pdftotext, though with imperfect formatting. Some articles are available in webpage form here.
lipu kule ? 2021- Some High Articles No CC BY-SA 4.0
ante toki pona ? 2021- Few High Translated fiction English CC BY 4.0
Mozilla Common Voice 0.24 2021- Few Varies Sentences, some translated No CC0 1.0 Contains: tu kuntu, jan Sitata, jan Kita's toki Ramble, some of toki soweli (see ante toki pona).
Transcripts of kalama sin ? 2021- Some Varies Spontaneous and scripted speech No CC BY-SA 4.0
  1. This includes non-Toki Pona data.