Corpora: Difference between revisions

From sona pona, the Toki Pona wiki
Content added Content deleted
(oops wrong wikisource)
Tag: 2017 source edit
No edit summary
Line 1: Line 1:
{{Needs work|complete metadata}}
{{Needs work|Complete metadata}}
The following is a list of Toki Pona '''corpora''' (or sources that can be readily used as such), useful for linguistic analysis or learning (machine or otherwise).
This is a list of Toki Pona '''corpora''' and other sources that can be readily used as such, useful for linguistic analysis or learning (machine or otherwise).

==List==
{| class="wikitable sortable"
{| class="wikitable sortable"
|+
|+
Line 15: Line 17:
|[https://github.com/davidar/nltk-tp/tree/master/Corpus davidar's nltk-tp collection]
|[https://github.com/davidar/nltk-tp/tree/master/Corpus davidar's nltk-tp collection]
|<4.68<ref group="lower-alpha">This includes non-Toki Pona data.</ref>
|<4.68<ref group="lower-alpha">This includes non-Toki Pona data.</ref>
|?-2017
|?–2017
|Many
|Many
|Varies
|Varies
Line 21: Line 23:
|some English
|some English
|None
|None
|Contains:
|Contains: {{tok|jan Kipo}}'s corpus (3.3 MB), Matthew Dean Martin's corpus (0.1 MB)
* {{tok|jan Kipo}}'s corpus (3.3 MB)
* Matthew Dean Martin's corpus (0.1 MB)
|-
|-
|[[Tatoeba]]
|[[Tatoeba]]
|2.35
|2.35
|2010–
|2010-
|Many
|Many
|Varies
|Varies
Line 31: Line 35:
|English, German, many others
|English, German, many others
|[https://creativecommons.org/licenses/by/2.0/fr/ CC BY 2.0 FR]
|[https://creativecommons.org/licenses/by/2.0/fr/ CC BY 2.0 FR]
|Data quality is marked in "Users' sentence reviews".
|Data quality marked in reviews
|-
|-
|{{tp|[[lipu tenpo]]}}
|{{tp|[[lipu tenpo]]}}
|~0.7
|~0.7
|2021–
|2021-
|Some
|Some
|High
|High
Line 41: Line 45:
|No
|No
|[https://creativecommons.org/licenses/by-sa/4.0/ CC BY-SA 4.0]
|[https://creativecommons.org/licenses/by-sa/4.0/ CC BY-SA 4.0]
|Text can be extracted using the command-line utility [https://www.xpdfreader.com/pdftotext-man.html pdftotext], though with imperfect formatting. Some articles are available in webpage form [https://liputenpo.org/toki/ here].
|Text can be extracted using the command-line utility [https://www.xpdfreader.com/pdftotext-man.html pdftotext], though with imperfect formatting. Some articles are available in webpage form [https://liputenpo.org/toki/ on their website].
|-
|-
|{{tp|[[lipu kule]]}}
|{{tp|[[lipu kule]]}}
|?
|?
|2021–
|2021-
|Some
|Some
|High
|High
Line 55: Line 59:
|[http://antetokipona.infinityfreeapp.com/csv/ {{tp|ante toki pona}}]
|[http://antetokipona.infinityfreeapp.com/csv/ {{tp|ante toki pona}}]
|?
|?
|2021–
|2021-
|Few
|Few
|High
|High
Line 65: Line 69:
|[https://github.com/common-voice/common-voice/blob/main/server/data/tok/sentence-collector.txt Mozilla Common Voice]
|[https://github.com/common-voice/common-voice/blob/main/server/data/tok/sentence-collector.txt Mozilla Common Voice]
|0.24
|0.24
|2021–
|2021-
|Few
|Few
|Varies
|Varies
|Sentences, some translated
|Sentences
|No
|No
|[https://creativecommons.org/publicdomain/zero/1.0/ CC0 1.0]
|[https://creativecommons.org/publicdomain/zero/1.0/ CC0 1.0]
|Contains: {{tp|[[tu kuntu]]}}, {{tp|[[jan Sitata]]}}, {{tok|jan Kita}}'s [https://github.com/Sobsz/toki-ramble {{tok|toki}} Ramble], some of {{tp|toki soweli}} (see {{tp|ante toki pona}}).
|Contains: {{tp|[[tu kuntu]]}}, {{tp|[[jan Sitata]]}}, {{tok|jan Kita}}'s [https://github.com/Sobsz/toki-ramble {{tok|toki}} Ramble], some of {{tp|toki soweli}} (see {{tp|ante toki pona}}).
|-
|-
|[[oldwikisource:kalama sin|Transcripts]] of {{tp|[[kalama sin]]}}
|{{tp|[[kalama sin]]}} ([[oldwikisource:kalama sin|transcripts]])
|?
|?
|2021–
|2021-
|Some
|Some
|Varies
|Varies
Line 83: Line 87:
|
|
|}
|}

==Notes==
<references group="lower-alpha" />
<references group="lower-alpha" />
{{Media}}
{{Media}}

Revision as of 15:48, 18 May 2024

Under construction This article needs work:

Complete metadata

If you know about this topic, you can help us by editing it. (See all)

This is a list of Toki Pona corpora and other sources that can be readily used as such, useful for linguistic analysis or learning (machine or otherwise).

List

Name Size (MB) Era Authors Fluency Type Parallel? License Notes
davidar's nltk-tp collection <4.68[a] ?–2017 Many Varies Varies some English None Contains:
  • jan Kipo's corpus (3.3 MB)
  • Matthew Dean Martin's corpus (0.1 MB)
Tatoeba 2.35 2010– Many Varies Translated sentences English, German, many others CC BY 2.0 FR Data quality marked in reviews
lipu tenpo ~0.7 2021– Some High Articles, poems No CC BY-SA 4.0 Text can be extracted using the command-line utility pdftotext, though with imperfect formatting. Some articles are available in webpage form on their website.
lipu kule ? 2021– Some High Articles No CC BY-SA 4.0
ante toki pona ? 2021– Few High Translated fiction English CC BY 4.0
Mozilla Common Voice 0.24 2021– Few Varies Sentences No CC0 1.0 Contains: tu kuntu, jan Sitata, jan Kita's toki Ramble, some of toki soweli (see ante toki pona).
kalama sin (transcripts) ? 2021– Some Varies Spontaneous and scripted speech No CC BY-SA 4.0

Notes

  1. This includes non-Toki Pona data.