From e3082e64b3ab8a8aa0777d63be69eb8b6d50a654 Mon Sep 17 00:00:00 2001 From: Sam Atman Date: Tue, 8 Jul 2025 12:12:20 -0400 Subject: Add Words.zig example to README --- NEWS.md | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 108 insertions(+) (limited to 'NEWS.md') diff --git a/NEWS.md b/NEWS.md index 8131878..0ccf151 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,5 +1,113 @@ # News +## zg v0.14.1 Release Notes + +In a flurry of activity during and after the `v0.14.0` beta, several +features were added (including from a new contributor!), and a bug +fixed. + +Presenting `zg v0.14.1`. As should be expected from a patch release, +there are no breaking changes to the interface, just bug fixes and +features. + +### Grapheme Zalgo Text Bugfix + +Until this release, `zg` was using a `u8` to store the length of a +`Grapheme`. While this is much larger than any "real" grapheme, the +Unicode grapheme segmentation algorithm allows graphemes of arbitrary +size to be constructed, often called [Zalgo text][Zalgo] after a +notorious and funny Stack Overflow answer making use of this affordance. + +Therefore, a crafted string could force an integer overflow, with all that +comes with it. The `.len` field of a `Grapheme` is now a `u32`, like the +`.offset` field. Due to padding, the `Grapheme` is the same size as it +was, just making use of the entire 8 bytes. + +Actually, both fields are now `uoffset`, for reasons described next. + +[Zalgo]: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 + +### Limits Section Added to README + +The README now clearly documents that some data structures and iterators +in `zg` use a `u32`. I've also made it possible to configure the library +to use a `u64` instead, and have included an explanation of why this is +not the solution to actual problems which it at first might seem. + +My job as maintainer is to provide a useful library to the community, and +comptime makes it easy and pleasant to tailor types to purpose. So for those +who see a need for `u64` values in those structures, just pass `-Dfat_offset` +or its equivalent, and you'll have them. + +I believe this to be neither necessary nor sufficient for handling data of +that size. But I can't anticipate every requirement, and don't want to +preclude it as a solution. + +### Iterators, Back and Forth + +A new contributor, Nemoos, took on the challenge of adding a reverse +iterator to `Graphemes`. Thanks Nemoos! + +I've taken the opportunity to fill in a few bits of functionality to +flesh these out. `code_point` now has a reverse iterator as well, and +either a forward or backward iterator can be reversed in-place. + +Reversing an iterator will always return the last non-`null` result +of calling that iterator. This is the only sane behavior, but +might be a bit unexpected without prior warning. + +There's also `codePointAtIndex` and `graphemeAtIndex`. These can be +given any index which falls within the Grapheme or Codepoint which +is returned. These always return a value, and therefore cannot be +called on an empty string. + +Finally, `Graphemes.iterateAfterGrapheme(string, grapheme)` will +return a forward iterator which will yield the grapheme after +`grapheme` when first called. `iterateBeforeGrapheme` has the +signature and result one might expect from this. + +`code_point` doesn't have an equivalent of those, since it isn't +useful: codepoints are one to four bytes in length, while obtaining +a grapheme reliably, given only an index, involves some pretty tricky +business to get right. The `Graphemes` API just described allows +code to obtain a Grapheme cursor and then begin iterating in either +direction, by calling `graphemeAtIndex` and providing it to either +of those functions. For codepoints, starting an iterator at either +`.offset` or `.offset + .len` will suffice, since the `CodePoint` +iterator is otherwise stateless. + +### Words Module + +The [Unicode annex][tr29] with the canonical grapheme segmentation +algorithm also includes algorithms for word and sentence segmentation. +`v0.14.1` includes an implementation of the word algorithm. + +It works like `Graphemes`. There's forward and reverse iteration, +`wordAtIndex`, and `iterate(Before|After)Word`. + +If anyone is looking for a challenge, there are open issues for sentence +segmentation and [line breaking][tr14]. + +[tr29]: https://www.unicode.org/reports/tr29/ +[tr14]: https://www.unicode.org/reports/tr14/ + +#### Runeset Used + +As a point of interest: + +Most of the rules in the word breaking algorithm come from a distinct +property table, `WordBreakProperties.txt` from the [UCD][UCD]. These +are made into a data structure familiar from the other modules. + +One rule, WB3c, uses the Extended Pictographic property. This is also +used in `Graphemes`, but to avoid a dependency on that library, I used +a [Runeset][Rune]. This is included statically, with only just as much +code as needed to recognize the sequences; `zg` itself remains free of +transitive dependencies. + +[UCD]: https://www.unicode.org/reports/tr44/ +[Rune]: https://github.com/mnemnion/runeset + ## zg v0.14.0 Release Notes This is the first minor point release since Sam Atman (me) took over -- cgit v1.2.3