Corpora: Difference between revisions
Content added Content deleted
No edit summary |
No edit summary |
||
(17 intermediate revisions by 6 users not shown) | |||
Line 1:
{{Needs work|Complete metadata}}
This is a list of Toki Pona '''corpora''' and other sources that can be readily used as such, useful for linguistic analysis or learning (machine or otherwise).
==List==
{| class="wikitable sortable"
|+
!Name
!Size (MB)
!Era
!Authors
!Fluency
!Type
!Parallel?
!License
!Notes
|-
|[https://github.com/davidar/nltk-tp/tree/master/Corpus davidar's nltk-tp collection]
|<4.68<ref group="lower-alpha">This includes non-Toki Pona data.</ref>
|?–2017
|Many
|Varies
|Varies
|some English
|None
|Contains:
* {{tok|jan Kipo}}'s corpus (3.3 MB)
* Matthew Dean Martin's corpus (0.1 MB)
|-
|[[Tatoeba]]
|2.35
|2010–
|Many
|Varies
|Translated sentences
|English, German, many others
|[https://creativecommons.org/licenses/by/2.0/fr/ CC BY 2.0 FR]
|Data quality marked in reviews
|-
|{{tp|[[lipu tenpo]]}}
|~0.7
|2021–
|Some
|High
|Articles, poems
|No
|[https://creativecommons.org/licenses/by-sa/4.0/ CC BY-SA 4.0]
|Text can be extracted using the command-line utility [https://www.xpdfreader.com/pdftotext-man.html pdftotext], though with imperfect formatting. Some articles are available in webpage form [https://liputenpo.org/toki/ on their website].
|-
|{{tp|[[lipu kule]]}}
|{{N/A|?}}
|2021–
|Some
|High
|Articles
|No
|[https://creativecommons.org/licenses/by-sa/4.0/ CC BY-SA 4.0]
|
|-
|[http://antetokipona.infinityfreeapp.com/csv/ {{tp|ante toki pona}}]
|{{N/A|?}}
|2021–
|Few
|High
|Translated fiction
|English
|[https://creativecommons.org/licenses/by/4.0/ CC BY 4.0]
|
|-
|[https://github.com/common-voice/common-voice/blob/main/server/data/tok/sentence-collector.txt Mozilla Common Voice]
|0.24
|2021–
|Few
|Varies
|Sentences
|No
|[https://creativecommons.org/publicdomain/zero/1.0/ CC0 1.0]
|Contains:
* {{tp|[[tu kuntu]]}}
* {{tp|[[jan Sitata]]}}
* {{tok|jan Kita}}'s [https://github.com/Sobsz/toki-ramble {{tok|toki}} Ramble]
* Parts of {{tp|toki soweli}} (see also {{tp|ante toki pona}}).
|-
|{{tp|[[kalama sin]]}} ([[oldwikisource:kalama sin|transcripts]])
|{{N/A|?}}
|2021–
|Some
|Varies
|Spontaneous and scripted speech
|No
|[https://creativecommons.org/licenses/by-sa/4.0/ CC BY-SA 4.0]
|
|}
==Notes==
<references group="lower-alpha" />
{{Media}}
[[Category:Literature| ]]
|
Latest revision as of 19:07, 18 May 2024
This is a list of Toki Pona corpora and other sources that can be readily used as such, useful for linguistic analysis or learning (machine or otherwise).
List[edit | edit source]
Name | Size (MB) | Era | Authors | Fluency | Type | Parallel? | License | Notes |
---|---|---|---|---|---|---|---|---|
davidar's nltk-tp collection | <4.68[a] | ?–2017 | Many | Varies | Varies | some English | None | Contains:
|
Tatoeba | 2.35 | 2010– | Many | Varies | Translated sentences | English, German, many others | CC BY 2.0 FR | Data quality marked in reviews |
lipu tenpo | ~0.7 | 2021– | Some | High | Articles, poems | No | CC BY-SA 4.0 | Text can be extracted using the command-line utility pdftotext, though with imperfect formatting. Some articles are available in webpage form on their website. |
lipu kule | ? | 2021– | Some | High | Articles | No | CC BY-SA 4.0 | |
ante toki pona | ? | 2021– | Few | High | Translated fiction | English | CC BY 4.0 | |
Mozilla Common Voice | 0.24 | 2021– | Few | Varies | Sentences | No | CC0 1.0 | Contains:
|
kalama sin (transcripts) | ? | 2021– | Some | Varies | Spontaneous and scripted speech | No | CC BY-SA 4.0 |
Notes[edit | edit source]
- ↑ This includes non-Toki Pona data.
Audio | Audiobooks · Dubs · Podcasts · Music |
---|---|
Literature | Bibliography · Books · Audiobooks · Comics · Corpora · In sitelen pona · Zines (lipu kule · lipu monsuta · lipu tenpo) |
Constrained writing | Poetry (Formats) · Palindromes · Pangrams · Tongue twisters |
Games | Games · Minecraft |