Corpora: Difference between revisions

From sona pona, the Toki Pona wiki
Content added Content deleted
(link to wikisource for kalama sin)
Tag: 2017 source edit
No edit summary
 
(2 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{Needs work|complete metadata}}
{{Needs work|Complete metadata}}
The following is a list of Toki Pona '''corpora''' (or sources that can be readily used as such), useful for linguistic analysis or learning (machine or otherwise).
This is a list of Toki Pona '''corpora''' and other sources that can be readily used as such, useful for linguistic analysis or learning (machine or otherwise).

==List==
{| class="wikitable sortable"
{| class="wikitable sortable"
|+
|+
Line 15: Line 17:
|[https://github.com/davidar/nltk-tp/tree/master/Corpus davidar's nltk-tp collection]
|[https://github.com/davidar/nltk-tp/tree/master/Corpus davidar's nltk-tp collection]
|<4.68<ref group="lower-alpha">This includes non-Toki Pona data.</ref>
|<4.68<ref group="lower-alpha">This includes non-Toki Pona data.</ref>
|?-2017
|?–2017
|Many
|Many
|Varies
|Varies
Line 21: Line 23:
|some English
|some English
|None
|None
|Contains:
|Contains: {{tok|jan Kipo}}'s corpus (3.3 MB), Matthew Dean Martin's corpus (0.1 MB)
* {{tok|jan Kipo}}'s corpus (3.3 MB)
* Matthew Dean Martin's corpus (0.1 MB)
|-
|-
|[[Tatoeba]]
|[[Tatoeba]]
|2.35
|2.35
|2010–
|2010-
|Many
|Many
|Varies
|Varies
Line 31: Line 35:
|English, German, many others
|English, German, many others
|[https://creativecommons.org/licenses/by/2.0/fr/ CC BY 2.0 FR]
|[https://creativecommons.org/licenses/by/2.0/fr/ CC BY 2.0 FR]
|Data quality is marked in "Users' sentence reviews".
|Data quality marked in reviews
|-
|-
|{{tp|[[lipu tenpo]]}}
|{{tp|[[lipu tenpo]]}}
|~0.7
|~0.7
|2021–
|2021-
|Some
|Some
|High
|High
Line 41: Line 45:
|No
|No
|[https://creativecommons.org/licenses/by-sa/4.0/ CC BY-SA 4.0]
|[https://creativecommons.org/licenses/by-sa/4.0/ CC BY-SA 4.0]
|Text can be extracted using the command-line utility [https://www.xpdfreader.com/pdftotext-man.html pdftotext], though with imperfect formatting. Some articles are available in webpage form [https://liputenpo.org/toki/ here].
|Text can be extracted using the command-line utility [https://www.xpdfreader.com/pdftotext-man.html pdftotext], though with imperfect formatting. Some articles are available in webpage form [https://liputenpo.org/toki/ on their website].
|-
|-
|{{tp|[[lipu kule]]}}
|{{tp|[[lipu kule]]}}
|{{N/A|?}}
|?
|2021–
|2021-
|Some
|Some
|High
|High
Line 54: Line 58:
|-
|-
|[http://antetokipona.infinityfreeapp.com/csv/ {{tp|ante toki pona}}]
|[http://antetokipona.infinityfreeapp.com/csv/ {{tp|ante toki pona}}]
|{{N/A|?}}
|?
|2021–
|2021-
|Few
|Few
|High
|High
Line 65: Line 69:
|[https://github.com/common-voice/common-voice/blob/main/server/data/tok/sentence-collector.txt Mozilla Common Voice]
|[https://github.com/common-voice/common-voice/blob/main/server/data/tok/sentence-collector.txt Mozilla Common Voice]
|0.24
|0.24
|2021–
|2021-
|Few
|Few
|Varies
|Varies
|Sentences, some translated
|Sentences
|No
|No
|[https://creativecommons.org/publicdomain/zero/1.0/ CC0 1.0]
|[https://creativecommons.org/publicdomain/zero/1.0/ CC0 1.0]
|Contains:
|Contains: {{tp|[[tu kuntu]]}}, {{tp|[[jan Sitata]]}}, {{tok|jan Kita}}'s [https://github.com/Sobsz/toki-ramble {{tok|toki}} Ramble], some of {{tp|toki soweli}} (see {{tp|ante toki pona}}).
* {{tp|[[tu kuntu]]}}
* {{tp|[[jan Sitata]]}}
* {{tok|jan Kita}}'s [https://github.com/Sobsz/toki-ramble {{tok|toki}} Ramble]
* Parts of {{tp|toki soweli}} (see also {{tp|ante toki pona}}).
|-
|-
|[[wikisource:kalama sin|Transcripts]] of {{tp|[[kalama sin]]}}
|{{tp|[[kalama sin]]}} ([[oldwikisource:kalama sin|transcripts]])
|{{N/A|?}}
|?
|2021–
|2021-
|Some
|Some
|Varies
|Varies
Line 83: Line 91:
|
|
|}
|}

==Notes==
<references group="lower-alpha" />
<references group="lower-alpha" />
{{Media}}
{{Media}}

Latest revision as of 19:07, 18 May 2024

Under construction This article needs work:

Complete metadata

If you know about this topic, you can help us by editing it. (See all)

This is a list of Toki Pona corpora and other sources that can be readily used as such, useful for linguistic analysis or learning (machine or otherwise).

List[edit | edit source]

Name Size (MB) Era Authors Fluency Type Parallel? License Notes
davidar's nltk-tp collection <4.68[a] ?–2017 Many Varies Varies some English None Contains:
  • jan Kipo's corpus (3.3 MB)
  • Matthew Dean Martin's corpus (0.1 MB)
Tatoeba 2.35 2010– Many Varies Translated sentences English, German, many others CC BY 2.0 FR Data quality marked in reviews
lipu tenpo ~0.7 2021– Some High Articles, poems No CC BY-SA 4.0 Text can be extracted using the command-line utility pdftotext, though with imperfect formatting. Some articles are available in webpage form on their website.
lipu kule ? 2021– Some High Articles No CC BY-SA 4.0
ante toki pona ? 2021– Few High Translated fiction English CC BY 4.0
Mozilla Common Voice 0.24 2021– Few Varies Sentences No CC0 1.0 Contains:
kalama sin (transcripts) ? 2021– Some Varies Spontaneous and scripted speech No CC BY-SA 4.0

Notes[edit | edit source]

  1. This includes non-Toki Pona data.