From e3082e64b3ab8a8aa0777d63be69eb8b6d50a654 Mon Sep 17 00:00:00 2001
From: Sam Atman
Date: Tue, 8 Jul 2025 12:12:20 -0400
Subject: Add Words.zig example to README

---
 NEWS.md | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 108 insertions(+)

(limited to 'NEWS.md')

diff --git a/NEWS.md b/NEWS.md
index 8131878..0ccf151 100644
--- a/NEWS.md
+++ b/NEWS.md
@@ -1,5 +1,113 @@
 # News
 
+## zg v0.14.1 Release Notes
+
+In a flurry of activity during and after the `v0.14.0` beta, several
+features were added (including from a new contributor!), and a bug
+fixed.
+
+Presenting `zg v0.14.1`.  As should be expected from a patch release,
+there are no breaking changes to the interface, just bug fixes and
+features.
+
+### Grapheme Zalgo Text Bugfix
+
+Until this release, `zg` was using a `u8` to store the length of a
+`Grapheme`.  While this is much larger than any "real" grapheme, the
+Unicode grapheme segmentation algorithm allows graphemes of arbitrary
+size to be constructed, often called [Zalgo text][Zalgo] after a
+notorious and funny Stack Overflow answer making use of this affordance.
+
+Therefore, a crafted string could force an integer overflow, with all that
+comes with it.  The `.len` field of a `Grapheme` is now a `u32`, like the
+`.offset` field.  Due to padding, the `Grapheme` is the same size as it
+was, just making use of the entire 8 bytes.
+
+Actually, both fields are now `uoffset`, for reasons described next.
+
+[Zalgo]: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
+
+### Limits Section Added to README
+
+The README now clearly documents that some data structures and iterators
+in `zg` use a `u32`.  I've also made it possible to configure the library
+to use a `u64` instead, and have included an explanation of why this is
+not the solution to actual problems which it at first might seem.
+
+My job as maintainer is to provide a useful library to the community, and
+comptime makes it easy and pleasant to tailor types to purpose. So for those
+who see a need for `u64` values in those structures, just pass `-Dfat_offset`
+or its equivalent, and you'll have them.
+
+I believe this to be neither necessary nor sufficient for handling data of
+that size.  But I can't anticipate every requirement, and don't want to
+preclude it as a solution.
+
+### Iterators, Back and Forth
+
+A new contributor, Nemoos, took on the challenge of adding a reverse
+iterator to `Graphemes`.  Thanks Nemoos!
+
+I've taken the opportunity to fill in a few bits of functionality to
+flesh these out.  `code_point` now has a reverse iterator as well, and
+either a forward or backward iterator can be reversed in-place.
+
+Reversing an iterator will always return the last non-`null` result
+of calling that iterator.  This is the only sane behavior, but
+might be a bit unexpected without prior warning.
+
+There's also `codePointAtIndex` and `graphemeAtIndex`.  These can be
+given any index which falls within the Grapheme or Codepoint which
+is returned.  These always return a value, and therefore cannot be
+called on an empty string.
+
+Finally, `Graphemes.iterateAfterGrapheme(string, grapheme)` will
+return a forward iterator which will yield the grapheme after
+`grapheme` when first called.  `iterateBeforeGrapheme` has the
+signature and result one might expect from this.
+
+`code_point` doesn't have an equivalent of those, since it isn't
+useful: codepoints are one to four bytes in length, while obtaining
+a grapheme reliably, given only an index, involves some pretty tricky
+business to get right.  The `Graphemes` API just described allows
+code to obtain a Grapheme cursor and then begin iterating in either
+direction, by calling `graphemeAtIndex` and providing it to either
+of those functions.  For codepoints, starting an iterator at either
+`.offset` or `.offset + .len` will suffice, since the `CodePoint`
+iterator is otherwise stateless.
+
+### Words Module
+
+The [Unicode annex][tr29] with the canonical grapheme segmentation
+algorithm also includes algorithms for word and sentence segmentation.
+`v0.14.1` includes an implementation of the word algorithm.
+
+It works like `Graphemes`.  There's forward and reverse iteration,
+`wordAtIndex`, and `iterate(Before|After)Word`.
+
+If anyone is looking for a challenge, there are open issues for sentence
+segmentation and [line breaking][tr14].
+
+[tr29]: https://www.unicode.org/reports/tr29/
+[tr14]: https://www.unicode.org/reports/tr14/
+
+#### Runeset Used
+
+As a point of interest:
+
+Most of the rules in the word breaking algorithm come from a distinct
+property table, `WordBreakProperties.txt` from the [UCD][UCD].  These
+are made into a data structure familiar from the other modules.
+
+One rule, WB3c, uses the Extended Pictographic property.  This is also
+used in `Graphemes`, but to avoid a dependency on that library, I used
+a [Runeset][Rune].  This is included statically, with only just as much
+code as needed to recognize the sequences; `zg` itself remains free of
+transitive dependencies.
+
+[UCD]: https://www.unicode.org/reports/tr44/
+[Rune]: https://github.com/mnemnion/runeset
+
 ## zg v0.14.0 Release Notes
 
 This is the first minor point release since Sam Atman (me) took over
-- 
cgit v1.2.3