summaryrefslogtreecommitdiff
path: root/NEWS.md
diff options
context:
space:
mode:
Diffstat (limited to 'NEWS.md')
-rw-r--r--NEWS.md108
1 files changed, 108 insertions, 0 deletions
diff --git a/NEWS.md b/NEWS.md
index 8131878..0ccf151 100644
--- a/NEWS.md
+++ b/NEWS.md
@@ -1,5 +1,113 @@
1# News 1# News
2 2
3## zg v0.14.1 Release Notes
4
5In a flurry of activity during and after the `v0.14.0` beta, several
6features were added (including from a new contributor!), and a bug
7fixed.
8
9Presenting `zg v0.14.1`. As should be expected from a patch release,
10there are no breaking changes to the interface, just bug fixes and
11features.
12
13### Grapheme Zalgo Text Bugfix
14
15Until this release, `zg` was using a `u8` to store the length of a
16`Grapheme`. While this is much larger than any "real" grapheme, the
17Unicode grapheme segmentation algorithm allows graphemes of arbitrary
18size to be constructed, often called [Zalgo text][Zalgo] after a
19notorious and funny Stack Overflow answer making use of this affordance.
20
21Therefore, a crafted string could force an integer overflow, with all that
22comes with it. The `.len` field of a `Grapheme` is now a `u32`, like the
23`.offset` field. Due to padding, the `Grapheme` is the same size as it
24was, just making use of the entire 8 bytes.
25
26Actually, both fields are now `uoffset`, for reasons described next.
27
28[Zalgo]: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
29
30### Limits Section Added to README
31
32The README now clearly documents that some data structures and iterators
33in `zg` use a `u32`. I've also made it possible to configure the library
34to use a `u64` instead, and have included an explanation of why this is
35not the solution to actual problems which it at first might seem.
36
37My job as maintainer is to provide a useful library to the community, and
38comptime makes it easy and pleasant to tailor types to purpose. So for those
39who see a need for `u64` values in those structures, just pass `-Dfat_offset`
40or its equivalent, and you'll have them.
41
42I believe this to be neither necessary nor sufficient for handling data of
43that size. But I can't anticipate every requirement, and don't want to
44preclude it as a solution.
45
46### Iterators, Back and Forth
47
48A new contributor, Nemoos, took on the challenge of adding a reverse
49iterator to `Graphemes`. Thanks Nemoos!
50
51I've taken the opportunity to fill in a few bits of functionality to
52flesh these out. `code_point` now has a reverse iterator as well, and
53either a forward or backward iterator can be reversed in-place.
54
55Reversing an iterator will always return the last non-`null` result
56of calling that iterator. This is the only sane behavior, but
57might be a bit unexpected without prior warning.
58
59There's also `codePointAtIndex` and `graphemeAtIndex`. These can be
60given any index which falls within the Grapheme or Codepoint which
61is returned. These always return a value, and therefore cannot be
62called on an empty string.
63
64Finally, `Graphemes.iterateAfterGrapheme(string, grapheme)` will
65return a forward iterator which will yield the grapheme after
66`grapheme` when first called. `iterateBeforeGrapheme` has the
67signature and result one might expect from this.
68
69`code_point` doesn't have an equivalent of those, since it isn't
70useful: codepoints are one to four bytes in length, while obtaining
71a grapheme reliably, given only an index, involves some pretty tricky
72business to get right. The `Graphemes` API just described allows
73code to obtain a Grapheme cursor and then begin iterating in either
74direction, by calling `graphemeAtIndex` and providing it to either
75of those functions. For codepoints, starting an iterator at either
76`.offset` or `.offset + .len` will suffice, since the `CodePoint`
77iterator is otherwise stateless.
78
79### Words Module
80
81The [Unicode annex][tr29] with the canonical grapheme segmentation
82algorithm also includes algorithms for word and sentence segmentation.
83`v0.14.1` includes an implementation of the word algorithm.
84
85It works like `Graphemes`. There's forward and reverse iteration,
86`wordAtIndex`, and `iterate(Before|After)Word`.
87
88If anyone is looking for a challenge, there are open issues for sentence
89segmentation and [line breaking][tr14].
90
91[tr29]: https://www.unicode.org/reports/tr29/
92[tr14]: https://www.unicode.org/reports/tr14/
93
94#### Runeset Used
95
96As a point of interest:
97
98Most of the rules in the word breaking algorithm come from a distinct
99property table, `WordBreakProperties.txt` from the [UCD][UCD]. These
100are made into a data structure familiar from the other modules.
101
102One rule, WB3c, uses the Extended Pictographic property. This is also
103used in `Graphemes`, but to avoid a dependency on that library, I used
104a [Runeset][Rune]. This is included statically, with only just as much
105code as needed to recognize the sequences; `zg` itself remains free of
106transitive dependencies.
107
108[UCD]: https://www.unicode.org/reports/tr44/
109[Rune]: https://github.com/mnemnion/runeset
110
3## zg v0.14.0 Release Notes 111## zg v0.14.0 Release Notes
4 112
5This is the first minor point release since Sam Atman (me) took over 113This is the first minor point release since Sam Atman (me) took over