Text compression: Difference between revisions

From sona pona, the Toki Pona wiki
Content added Content deleted
No edit summary
mNo edit summary
Line 3: Line 3:
In March 2010, inspired to compress Toki Pona text to use fewer characters on {{w|Twitter}}, {{tok|jan Mato}} collated several potential lossy and lossless compression schemes. Of the options presented, [[Toki Pona Script]] was noted as having the best {{w|Data compression ratio|compression ratio}},<ref>{{cite web|url=http://forums.tokipona.org/viewtopic.php?t=1389|title=Best compression for {{tok|toki lili}}?|author=|username={{tok|janMato}} (original poster), zeme|date=Mar 19, 2010|website=[[Toki Pona Forums]]|publisher=|access-date=2024-01-10|quote=}}</ref> and is lossless. Owing to poor [[Unicode]] support for Toki Pona Script at the time, {{tok|jan Josan}} and {{tok|jan Mato}} created a {{tp|[[sitelen Kansi]]}} character set in July of that year.<ref>{{cite web|url=http://forums.tokipona.org/viewtopic.php?t=1519|title=toki pona in chinese/kanji?|author=|username={{tok|janMato}} (original poster), {{tok|janKipo}}, {{tok|jan Josan}}, ''et al.''|date=Jul 18, 2010|website=[[Toki Pona Forums]]|publisher=|access-date=2024-01-10|quote=}}</ref> Later equivalents to Toki Pona Script include the {{tok|[[Sitelen Pona]]}} [[UCSUR]] block and the {{tp|[[sitelen Emosi]]}} writing systems, which also only use one [[Unicode]] character per word.
In March 2010, inspired to compress Toki Pona text to use fewer characters on {{w|Twitter}}, {{tok|jan Mato}} collated several potential lossy and lossless compression schemes. Of the options presented, [[Toki Pona Script]] was noted as having the best {{w|Data compression ratio|compression ratio}},<ref>{{cite web|url=http://forums.tokipona.org/viewtopic.php?t=1389|title=Best compression for {{tok|toki lili}}?|author=|username={{tok|janMato}} (original poster), zeme|date=Mar 19, 2010|website=[[Toki Pona Forums]]|publisher=|access-date=2024-01-10|quote=}}</ref> and is lossless. Owing to poor [[Unicode]] support for Toki Pona Script at the time, {{tok|jan Josan}} and {{tok|jan Mato}} created a {{tp|[[sitelen Kansi]]}} character set in July of that year.<ref>{{cite web|url=http://forums.tokipona.org/viewtopic.php?t=1519|title=toki pona in chinese/kanji?|author=|username={{tok|janMato}} (original poster), {{tok|janKipo}}, {{tok|jan Josan}}, ''et al.''|date=Jul 18, 2010|website=[[Toki Pona Forums]]|publisher=|access-date=2024-01-10|quote=}}</ref> Later equivalents to Toki Pona Script include the {{tok|[[Sitelen Pona]]}} [[UCSUR]] block and the {{tp|[[sitelen Emosi]]}} writing systems, which also only use one [[Unicode]] character per word.


{{tok|[[jan Misali]]}}'s [[ASCII syllabary]] allows each syllable to be reduced to 7 {{w|bit}}s. Any punctuation would be lost upon conversion into this system, and there is no recommendation for how to mark proper [[name]]s. A major limiting factor for the compression ratio is the need to separate words, which is generally done with the ASCII {{w|Whitespace character|space}}.
{{tok|[[jan Misali]]}}'s [[ASCII syllabary]] allows each syllable to be reduced to 7 {{w|bit}}s. Most punctuation would be lost upon conversion into this system, and there is no recommendation for how to mark proper [[name]]s. A major limiting factor for the compression ratio is the need to separate words, which is generally done with the ASCII {{w|Whitespace character|space}}.


==References==
==References==

Revision as of 00:09, 11 January 2024

Toki Pona's small size has attracted interest in text compression techniques. Some writing systems are created expressly for this purpose.

In March 2010, inspired to compress Toki Pona text to use fewer characters on Twitter, jan Mato collated several potential lossy and lossless compression schemes. Of the options presented, Toki Pona Script was noted as having the best compression ratio,[1] and is lossless. Owing to poor Unicode support for Toki Pona Script at the time, jan Josan and jan Mato created a sitelen Kansi character set in July of that year.[2] Later equivalents to Toki Pona Script include the Sitelen Pona UCSUR block and the sitelen Emosi writing systems, which also only use one Unicode character per word.

jan Misali's ASCII syllabary allows each syllable to be reduced to 7 bits. Most punctuation would be lost upon conversion into this system, and there is no recommendation for how to mark proper names. A major limiting factor for the compression ratio is the need to separate words, which is generally done with the ASCII space.

References

English Wikipedia has an article on
text compression.
  1. [janMato (original poster), zeme]. (19 March 2010). "Best compression for toki lili?". Toki Pona Forums. Retrieved 10 January 2024.
  2. [janMato (original poster), janKipo, jan Josan, et al.]. (18 July 2010). "toki pona in chinese/kanji?". Toki Pona Forums. Retrieved 10 January 2024.
This page is a stub. You can help us by expanding it.