Corpora: Difference between revisions
Content added Content deleted
(oops wrong wikisource) Tag: 2017 source edit |
No edit summary |
||
Line 1: | Line 1: | ||
{{Needs work| |
{{Needs work|Complete metadata}} |
||
This is a list of Toki Pona '''corpora''' and other sources that can be readily used as such, useful for linguistic analysis or learning (machine or otherwise). |
|||
==List== |
|||
{| class="wikitable sortable" |
{| class="wikitable sortable" |
||
|+ |
|+ |
||
Line 15: | Line 17: | ||
|[https://github.com/davidar/nltk-tp/tree/master/Corpus davidar's nltk-tp collection] |
|[https://github.com/davidar/nltk-tp/tree/master/Corpus davidar's nltk-tp collection] |
||
|<4.68<ref group="lower-alpha">This includes non-Toki Pona data.</ref> |
|<4.68<ref group="lower-alpha">This includes non-Toki Pona data.</ref> |
||
|? |
|?–2017 |
||
|Many |
|Many |
||
|Varies |
|Varies |
||
Line 21: | Line 23: | ||
|some English |
|some English |
||
|None |
|None |
||
|Contains: |
|||
⚫ | |||
* {{tok|jan Kipo}}'s corpus (3.3 MB) |
|||
⚫ | |||
|- |
|- |
||
|[[Tatoeba]] |
|[[Tatoeba]] |
||
|2.35 |
|2.35 |
||
|2010– |
|||
|2010- |
|||
|Many |
|Many |
||
|Varies |
|Varies |
||
Line 31: | Line 35: | ||
|English, German, many others |
|English, German, many others |
||
|[https://creativecommons.org/licenses/by/2.0/fr/ CC BY 2.0 FR] |
|[https://creativecommons.org/licenses/by/2.0/fr/ CC BY 2.0 FR] |
||
|Data quality |
|Data quality marked in reviews |
||
|- |
|- |
||
|{{tp|[[lipu tenpo]]}} |
|{{tp|[[lipu tenpo]]}} |
||
|~0.7 |
|~0.7 |
||
|2021– |
|||
|2021- |
|||
|Some |
|Some |
||
|High |
|High |
||
Line 41: | Line 45: | ||
|No |
|No |
||
|[https://creativecommons.org/licenses/by-sa/4.0/ CC BY-SA 4.0] |
|[https://creativecommons.org/licenses/by-sa/4.0/ CC BY-SA 4.0] |
||
|Text can be extracted using the command-line utility [https://www.xpdfreader.com/pdftotext-man.html pdftotext], though with imperfect formatting. Some articles are available in webpage form [https://liputenpo.org/toki/ |
|Text can be extracted using the command-line utility [https://www.xpdfreader.com/pdftotext-man.html pdftotext], though with imperfect formatting. Some articles are available in webpage form [https://liputenpo.org/toki/ on their website]. |
||
|- |
|- |
||
|{{tp|[[lipu kule]]}} |
|{{tp|[[lipu kule]]}} |
||
|? |
|? |
||
|2021– |
|||
|2021- |
|||
|Some |
|Some |
||
|High |
|High |
||
Line 55: | Line 59: | ||
|[http://antetokipona.infinityfreeapp.com/csv/ {{tp|ante toki pona}}] |
|[http://antetokipona.infinityfreeapp.com/csv/ {{tp|ante toki pona}}] |
||
|? |
|? |
||
|2021– |
|||
|2021- |
|||
|Few |
|Few |
||
|High |
|High |
||
Line 65: | Line 69: | ||
|[https://github.com/common-voice/common-voice/blob/main/server/data/tok/sentence-collector.txt Mozilla Common Voice] |
|[https://github.com/common-voice/common-voice/blob/main/server/data/tok/sentence-collector.txt Mozilla Common Voice] |
||
|0.24 |
|0.24 |
||
|2021– |
|||
|2021- |
|||
|Few |
|Few |
||
|Varies |
|Varies |
||
|Sentences |
|Sentences |
||
|No |
|No |
||
|[https://creativecommons.org/publicdomain/zero/1.0/ CC0 1.0] |
|[https://creativecommons.org/publicdomain/zero/1.0/ CC0 1.0] |
||
|Contains: {{tp|[[tu kuntu]]}}, {{tp|[[jan Sitata]]}}, {{tok|jan Kita}}'s [https://github.com/Sobsz/toki-ramble {{tok|toki}} Ramble], some of {{tp|toki soweli}} (see {{tp|ante toki pona}}). |
|Contains: {{tp|[[tu kuntu]]}}, {{tp|[[jan Sitata]]}}, {{tok|jan Kita}}'s [https://github.com/Sobsz/toki-ramble {{tok|toki}} Ramble], some of {{tp|toki soweli}} (see {{tp|ante toki pona}}). |
||
|- |
|- |
||
|[[ |
|{{tp|[[kalama sin]]}} ([[oldwikisource:kalama sin|transcripts]]) |
||
|? |
|? |
||
|2021– |
|||
|2021- |
|||
|Some |
|Some |
||
|Varies |
|Varies |
||
Line 83: | Line 87: | ||
| |
| |
||
|} |
|} |
||
==Notes== |
|||
<references group="lower-alpha" /> |
<references group="lower-alpha" /> |
||
{{Media}} |
{{Media}} |
Revision as of 15:48, 18 May 2024
This is a list of Toki Pona corpora and other sources that can be readily used as such, useful for linguistic analysis or learning (machine or otherwise).
List
Name | Size (MB) | Era | Authors | Fluency | Type | Parallel? | License | Notes |
---|---|---|---|---|---|---|---|---|
davidar's nltk-tp collection | <4.68[a] | ?–2017 | Many | Varies | Varies | some English | None | Contains:
|
Tatoeba | 2.35 | 2010– | Many | Varies | Translated sentences | English, German, many others | CC BY 2.0 FR | Data quality marked in reviews |
lipu tenpo | ~0.7 | 2021– | Some | High | Articles, poems | No | CC BY-SA 4.0 | Text can be extracted using the command-line utility pdftotext, though with imperfect formatting. Some articles are available in webpage form on their website. |
lipu kule | ? | 2021– | Some | High | Articles | No | CC BY-SA 4.0 | |
ante toki pona | ? | 2021– | Few | High | Translated fiction | English | CC BY 4.0 | |
Mozilla Common Voice | 0.24 | 2021– | Few | Varies | Sentences | No | CC0 1.0 | Contains: tu kuntu, jan Sitata, jan Kita's toki Ramble, some of toki soweli (see ante toki pona). |
kalama sin (transcripts) | ? | 2021– | Some | Varies | Spontaneous and scripted speech | No | CC BY-SA 4.0 |
Notes
- ↑ This includes non-Toki Pona data.
Audio | Audiobooks · Dubs · Podcasts · Music |
---|---|
Literature | Bibliography · Books · Audiobooks · Comics · Corpora · In sitelen pona · Zines (lipu kule · lipu monsuta · lipu tenpo) |
Constrained writing | Poetry (Formats) · Palindromes · Pangrams · Tongue twisters |
Games | Games · Minecraft |