Corpora: Difference between revisions

688 bytes added ,  1 year ago
m
Reverted edits by Mods are asleep qazw (talk) to last revision by Jan Ke Tami
No edit summary
Tag: Reverted
m (Reverted edits by Mods are asleep qazw (talk) to last revision by Jan Ke Tami)
Tag: Rollback
Line 1:
{{Extra license|[https://creativecommons.org/publicdomain/zero/1.0/ CC0 1.0]|it mostly consists of copyrighteduncopyrightable data}} <!-- feel free to remove, i don't mind -->
 
{{Hatnote|This page was previously located at [https://pad.snopyta.org/lDb2EfOZQpmleu-ZktbDzg pad.snopyta.org].}}
 
== davidar’s metacorpus ==
 
* type: varied
* quality: varied (i’ve heard jan Kipo tampered with his corpus to conform to his idea of toki pona grammar)
* dialect: mostly old/pu
* size: 4675k ''but'' a bunch of it is in english or duplicated, gotta count it properly eventually
* license: none, some under various cc licenses (247k, mostly by jan Kipo)
* preprocessing: probably a lot
* where: https://github.com/davidar/nltk-tp/tree/master/Corpus
 
note that this corpus contains the entirety of:
 
* the [https://github.com/matthewdeanmartin/tokipona.parser/tree/master/TokiPonaTools/TokiPona/corpus/forums TokiPonaTools] corpus (104k)
* the jan Kipo corpus [with english] (3281k)
* the Little Prince translation [with english if you wanna look for it] (54k)