Corpora: Difference between revisions

From sona pona, the Toki Pona wiki
Content added Content deleted
(link to wikisource for kalama sin)
Tag: 2017 source edit
(oops wrong wikisource)
Tag: 2017 source edit
Line 73: Line 73:
|Contains: {{tp|[[tu kuntu]]}}, {{tp|[[jan Sitata]]}}, {{tok|jan Kita}}'s [https://github.com/Sobsz/toki-ramble {{tok|toki}} Ramble], some of {{tp|toki soweli}} (see {{tp|ante toki pona}}).
|Contains: {{tp|[[tu kuntu]]}}, {{tp|[[jan Sitata]]}}, {{tok|jan Kita}}'s [https://github.com/Sobsz/toki-ramble {{tok|toki}} Ramble], some of {{tp|toki soweli}} (see {{tp|ante toki pona}}).
|-
|-
|[[wikisource:kalama sin|Transcripts]] of {{tp|[[kalama sin]]}}
|[[oldwikisource:kalama sin|Transcripts]] of {{tp|[[kalama sin]]}}
|?
|?
|2021-
|2021-

Revision as of 21:22, 4 April 2024

Under construction This article needs work:

complete metadata

If you know about this topic, you can help us by editing it. (See all)

The following is a list of Toki Pona corpora (or sources that can be readily used as such), useful for linguistic analysis or learning (machine or otherwise).

Name Size (MB) Era Authors Fluency Type Parallel? License Notes
davidar's nltk-tp collection <4.68[a] ?-2017 Many Varies Varies some English None Contains: jan Kipo's corpus (3.3 MB), Matthew Dean Martin's corpus (0.1 MB)
Tatoeba 2.35 2010- Many Varies Translated sentences English, German, many others CC BY 2.0 FR Data quality is marked in "Users' sentence reviews".
lipu tenpo ~0.7 2021- Some High Articles, poems No CC BY-SA 4.0 Text can be extracted using the command-line utility pdftotext, though with imperfect formatting. Some articles are available in webpage form here.
lipu kule ? 2021- Some High Articles No CC BY-SA 4.0
ante toki pona ? 2021- Few High Translated fiction English CC BY 4.0
Mozilla Common Voice 0.24 2021- Few Varies Sentences, some translated No CC0 1.0 Contains: tu kuntu, jan Sitata, jan Kita's toki Ramble, some of toki soweli (see ante toki pona).
Transcripts of kalama sin ? 2021- Some Varies Spontaneous and scripted speech No CC BY-SA 4.0
  1. This includes non-Toki Pona data.