Corpora: Difference between revisions
Content added Content deleted
No edit summary |
(tableification, trimming) Tag: 2017 source edit |
||
Line 1: | Line 1: | ||
{{Needs work|complete metadata}} |
|||
{{Extra license|[https://creativecommons.org/publicdomain/zero/1.0/ CC0 1.0]|it mostly consists of uncopyrightable data}} <!-- feel free to remove, i don't mind --> |
|||
The following is a list of Toki Pona '''corpora''' (or sources that can be readily used as such), useful for linguistic analysis or learning (machine or otherwise). |
|||
{| class="wikitable sortable" |
|||
{{Hatnote|This page was previously located at [https://pad.snopyta.org/lDb2EfOZQpmleu-ZktbDzg pad.snopyta.org].}} |
|||
|+ |
|||
!Name |
|||
== davidar’s metacorpus == |
|||
!Size (MB) |
|||
!Era |
|||
* type: varied |
|||
!Authors |
|||
* quality: varied (i’ve heard jan Kipo tampered with his corpus to conform to his idea of toki pona grammar) |
|||
!Fluency |
|||
* dialect: mostly old/pu |
|||
!Type |
|||
* size: 4675k ''but'' a bunch of it is in english or duplicated, gotta count it properly eventually |
|||
!Parallel? |
|||
* license: none, some under various cc licenses (247k, mostly by jan Kipo) |
|||
!License |
|||
* preprocessing: probably a lot |
|||
!Notes |
|||
* where: https://github.com/davidar/nltk-tp/tree/master/Corpus |
|||
|- |
|||
|[https://github.com/davidar/nltk-tp/tree/master/Corpus davidar's nltk-tp collection] |
|||
note that this corpus contains the entirety of: |
|||
|<4.68<ref group="lower-alpha">This includes non-Toki Pona data.</ref> |
|||
|?-2017 |
|||
* the [https://github.com/matthewdeanmartin/tokipona.parser/tree/master/TokiPonaTools/TokiPona/corpus/forums TokiPonaTools] corpus (104k) |
|||
|Many |
|||
* the jan Kipo corpus [with english] (3281k) |
|||
|Varies |
|||
* the Little Prince translation [with english if you wanna look for it] (54k) |
|||
|Varies |
|||
|some English |
|||
== tatoeba [with english] == |
|||
|None |
|||
|Contains: {{tok|jan Kipo}}'s corpus (3.3 MB), Matthew Dean Martin's corpus (0.1 MB) |
|||
* type: single sentences, some with translations |
|||
|- |
|||
* quality: varied (can be filtered with <code>Users' sentence reviews</code>) |
|||
|[[Tatoeba]] |
|||
* dialect: mostly pu |
|||
|2.35 |
|||
* size: 1896k (as of 2021-06-12) |
|||
|2010- |
|||
* license: cc-by 2.5 fr |
|||
|Many |
|||
* preprocessing: extract last column of tsv file, replace <code>Ton</code> and <code>Mali</code>/<code>Mewi</code>/<code>Mawi</code> with random names |
|||
|Varies |
|||
* where: https://tatoeba.org/en/downloads |
|||
|Translated sentences |
|||
|English, German, many others |
|||
audio recordings sorta available (under cc-by-sa 4.0) but you’ll have to scrape them yourself |
|||
|[https://creativecommons.org/licenses/by/2.0/fr/ CC BY 2.0 FR] |
|||
|Data quality is marked in "Users' sentence reviews". |
|||
== wikipesija == |
|||
|- |
|||
|{{tp|[[lipu tenpo]]}} |
|||
* type: articles, some definitions, a bit of discussion in talk pages |
|||
|~0.7 |
|||
* quality: varied |
|||
|2021- |
|||
* dialect: varied |
|||
|Some |
|||
* size: 1084k raw, probably half? when stripped (as of 2021-06-13) |
|||
|High |
|||
* license: cc-by-sa 3.0 (some 4.0) |
|||
|Articles, poems |
|||
* preprocessing: strip templates and markup |
|||
|No |
|||
* where: |
|||
|[https://creativecommons.org/licenses/by-sa/4.0/ CC BY-SA 4.0] |
|||
|Text can be extracted using the command-line utility [https://www.xpdfreader.com/pdftotext-man.html pdftotext], though with imperfect formatting. Some articles are available in webpage form [https://liputenpo.org/toki/ here]. |
|||
** https://wikipesija.org/wiki/ilo:Export (for chosen categories) |
|||
|- |
|||
|{{tp|[[lipu kule]]}} |
|||
== ma pona screenplay contest == |
|||
|? |
|||
|2021- |
|||
* type: screenplay |
|||
|Some |
|||
* quality: good, though grammar funnies are abound |
|||
|High |
|||
* dialect: ma ponian |
|||
|Articles |
|||
* size: 54.2k |
|||
|No |
|||
*# toki ala o e toki Inli - 1.5k |
|||
|[https://creativecommons.org/licenses/by-sa/4.0/ CC BY-SA 4.0] |
|||
*# nanpa pi kipisi ala - 0.5k (not even really toki pona, not worth it) |
|||
| |
|||
*# tu kuntu - 35.6k |
|||
|- |
|||
*# mijomi telo - 10.somethingk (can’t copy first page for some reason) |
|||
|[http://antetokipona.infinityfreeapp.com/csv/ {{tp|ante toki pona}}] |
|||
*# wi lon - 7.1k |
|||
|? |
|||
* license: none |
|||
|2021- |
|||
* preprocessing: extract 3 and 4 from pdf, uppercase names in 4, probably move names from header to somewhere else |
|||
|Few |
|||
* where: |
|||
|High |
|||
*# [https://docs.google.com/document/d/1W21rSjx2eyYLjcipFGcmLEa-nQenge7wzLk87Tq-CuE/edit toki ala o e toki Inli] |
|||
|Translated fiction |
|||
*# [https://docs.google.com/document/d/1DXcXoUm8vSAGsAtXuhhiMG36jAGgbLGXG6h4b9QrcrY/edit nanpa pi kipisi ala] |
|||
|English |
|||
*# [https://drive.google.com/file/d/1fwvben0Uo3ddmhWBZEarWHt80ax9LQiK/view tu kuntu] |
|||
|[https://creativecommons.org/licenses/by/4.0/ CC BY 4.0] |
|||
*# [https://drive.google.com/file/d/1wGSEiI3XlJ32YKeFRmp6U-HMKW96Ac_4/view mijomi telo] |
|||
| |
|||
*# [https://docs.google.com/document/d/1xl5osTAdUfP96ILzYaHpEnSDcxdDVKZ4t01Y8j9ul7w/edit wi lon] |
|||
|- |
|||
|[https://github.com/common-voice/common-voice/blob/main/server/data/tok/sentence-collector.txt Mozilla Common Voice] |
|||
== lipu kule == |
|||
|0.24 |
|||
|2021- |
|||
* type: essays |
|||
|Few |
|||
* quality: good |
|||
|Varies |
|||
* dialect: ma ponian |
|||
|Sentences, some translated |
|||
* size: ~75k as of 2022-06-25 |
|||
|No |
|||
* license: cc-by-sa 4.0 |
|||
|[https://creativecommons.org/publicdomain/zero/1.0/ CC0 1.0] |
|||
* preprocessing: remove frontmatter, remove markdown |
|||
|Contains: {{tp|[[tu kuntu]]}}, {{tp|[[jan Sitata]]}}, {{tok|jan Kita}}'s [https://github.com/Sobsz/toki-ramble {{tok|toki}} Ramble], some of {{tp|toki soweli}} (see {{tp|ante toki pona}}). |
|||
* where: https://github.com/lipukule/site/tree/main/content/tok/post |
|||
|- |
|||
|[https://drive.google.com/drive/folders/12H2xY06Wtwh4V6zoPOhDnE_R4aZ5sVJp Transcripts] of {{tp|[[kalama sin]]}} |
|||
== lipu tenpo == |
|||
|? |
|||
|2021- |
|||
* type: essays |
|||
|Some |
|||
* quality: good |
|||
|Varies |
|||
* dialect: ma ponian |
|||
|Spontaneous and scripted speech |
|||
* size: ~400k as of 2022-06-25 (assuming all are similar to lipu tenpo nanpa pan (~25k per issue)) |
|||
|No |
|||
* license: cc-by-sa 4.0 |
|||
|[https://creativecommons.org/licenses/by-sa/4.0/ CC BY-SA 4.0] |
|||
* preprocessing: surprisingly easy thanks to [https://www.xpdfreader.com/pdftotext-man.html pdftotext], some manual checking still required though (e.g. nanpa pan has a sitelen pona thing) |
|||
| |
|||
* where: https://liputenpo.org/ (also: https://wikisource.org/wiki/Category:Lipu_tenpo) |
|||
|} |
|||
<references group="lower-alpha" /> |
|||
== jan Kita’s toki Ramble == |
|||
* type: essays with topic |
|||
* quality: depends on how much you detest my (edit: outdated) style |
|||
* dialect: very freeform ma ponian, [https://github.com/Sobsz/toki-pona/blob/master/kita.md some of it detailed here] |
|||
* size: 19.5k as of 2021-08-13 |
|||
* license: cc0 |
|||
* preprocessing: none, though maybe remove <code>toki nasa pi ike mi.txt</code> |
|||
* where: https://github.com/Sobsz/toki-ramble |
|||
== ante toki pona [with english] == |
|||
* type: prose, dialogue |
|||
* quality: varied |
|||
* dialect: varied |
|||
* size: ~650k when raw as of 2021-06-29, far less once the empty parts are filtered out |
|||
* license: cc-by 4.0 |
|||
* where: http://antetokipona.infinityfreeapp.com/csv/ |
|||
includes pepper & carrot |
|||
== kalama sin (transcribed by kon Itan) == |
|||
* type: spoken dialogue |
|||
* quality: as good as spontaneous speech can be |
|||
* dialect: varied, mostly ma ponian |
|||
* size: ~97k as of 2022-01-30 (assuming 50% formatting overhead) |
|||
* license: none that i know of |
|||
* preprocessing: convert from .srt to .txt (remove the first 2 lines after each blank line), remove speaker labels, deal with square brackets |
|||
* where: https://drive.google.com/drive/folders/12H2xY06Wtwh4V6zoPOhDnE_R4aZ5sVJp |
|||
* clean version: https://cdn.discordapp.com/attachments/773267909903908894/1094680735069257798/file {{dead link}} |
|||
== [todo] library == |
|||
https://docs.google.com/document/d/1IdMucmhPCzvoUF94Gp25XCwocWOl4PfQ_wfOkiU8cu8/edit?usp=sharing |
|||
== [todo] jan Telakoman’s blog [with English] == |
|||
* type: prose |
|||
* quality: good |
|||
* dialect: |
|||
* size: |
|||
* license: |
|||
* where: https://joelthomastr.github.io/tokipona/README_si; https://github.com/joelthomastr/tokipona |
|||
the english is a literal translation of the toki pona, with all the awkwardness that implies; this may or may not be desired |
|||
== mozilla common voice == |
|||
* type: sentences |
|||
* quality: varied |
|||
* dialect: varied |
|||
* size: 230k as of 2022-06-27 |
|||
* license: cc0 |
|||
* where: https://github.com/common-voice/common-voice/blob/main/server/data/tok/sentence-collector.txt |
|||
note that this corpus largely consists of these already listed sources: |
|||
* jan Kita’s toki Ramble |
|||
* ante toki pona (toki soweli) |
|||
* tu kuntu |
|||
* jan Sitata |
|||
audio recordings (of varying quality) available here: https://commonvoice.mozilla.org/en/datasets |
|||
== jan Sitata [with english] == |
|||
* type: book |
|||
* quality: good |
|||
* dialect: pu |
|||
* size: ~60k |
|||
* license: cc0, explicit consent for machine learning by jan Sonja ([https://discord.com/channels/301377942062366741/912286596517220363/991845727095509002 ma pona]) |
|||
* where: https://tokipona.org/sitata/ |
|||
english: https://www.gutenberg.org/files/2500/2500-h/2500-h.htm |
|||
== [todo] davidar’s translations [with english] == |
|||
https://github.com/davidar/toki-ante |
|||
== storyweaver [with english] == |
|||
* type: translations of children's books |
|||
* quality: varied |
|||
* dialect: varied |
|||
* size: 17 books as of 2022-11-23, rough estimate: ~10k |
|||
* license: cc-by 4.0 |
|||
* preprocessing: good luck |
|||
* where: https://storyweaver.org.in/stories?language=Toki%20Pona |
|||
{{Media}} |
{{Media}} |
||
[[Category:Literature| ]] |
[[Category:Literature| ]] |
Revision as of 20:21, 4 April 2024
The following is a list of Toki Pona corpora (or sources that can be readily used as such), useful for linguistic analysis or learning (machine or otherwise).
Name | Size (MB) | Era | Authors | Fluency | Type | Parallel? | License | Notes |
---|---|---|---|---|---|---|---|---|
davidar's nltk-tp collection | <4.68[a] | ?-2017 | Many | Varies | Varies | some English | None | Contains: jan Kipo's corpus (3.3 MB), Matthew Dean Martin's corpus (0.1 MB) |
Tatoeba | 2.35 | 2010- | Many | Varies | Translated sentences | English, German, many others | CC BY 2.0 FR | Data quality is marked in "Users' sentence reviews". |
lipu tenpo | ~0.7 | 2021- | Some | High | Articles, poems | No | CC BY-SA 4.0 | Text can be extracted using the command-line utility pdftotext, though with imperfect formatting. Some articles are available in webpage form here. |
lipu kule | ? | 2021- | Some | High | Articles | No | CC BY-SA 4.0 | |
ante toki pona | ? | 2021- | Few | High | Translated fiction | English | CC BY 4.0 | |
Mozilla Common Voice | 0.24 | 2021- | Few | Varies | Sentences, some translated | No | CC0 1.0 | Contains: tu kuntu, jan Sitata, jan Kita's toki Ramble, some of toki soweli (see ante toki pona). |
Transcripts of kalama sin | ? | 2021- | Some | Varies | Spontaneous and scripted speech | No | CC BY-SA 4.0 |
- ↑ This includes non-Toki Pona data.
Audio | Audiobooks · Dubs · Podcasts · Music |
---|---|
Literature | Bibliography · Books · Audiobooks · Comics · Corpora · In sitelen pona · Zines (lipu kule · lipu monsuta · lipu tenpo) |
Constrained writing | Poetry (Formats) · Palindromes · Pangrams · Tongue twisters |
Games | Games · Minecraft |