summaryrefslogtreecommitdiff
path: root/NEWS.md
diff options
context:
space:
mode:
Diffstat (limited to 'NEWS.md')
-rw-r--r--NEWS.md114
1 files changed, 111 insertions, 3 deletions
diff --git a/NEWS.md b/NEWS.md
index a432c2f..0ccf151 100644
--- a/NEWS.md
+++ b/NEWS.md
@@ -1,5 +1,113 @@
1# News 1# News
2 2
3## zg v0.14.1 Release Notes
4
5In a flurry of activity during and after the `v0.14.0` beta, several
6features were added (including from a new contributor!), and a bug
7fixed.
8
9Presenting `zg v0.14.1`. As should be expected from a patch release,
10there are no breaking changes to the interface, just bug fixes and
11features.
12
13### Grapheme Zalgo Text Bugfix
14
15Until this release, `zg` was using a `u8` to store the length of a
16`Grapheme`. While this is much larger than any "real" grapheme, the
17Unicode grapheme segmentation algorithm allows graphemes of arbitrary
18size to be constructed, often called [Zalgo text][Zalgo] after a
19notorious and funny Stack Overflow answer making use of this affordance.
20
21Therefore, a crafted string could force an integer overflow, with all that
22comes with it. The `.len` field of a `Grapheme` is now a `u32`, like the
23`.offset` field. Due to padding, the `Grapheme` is the same size as it
24was, just making use of the entire 8 bytes.
25
26Actually, both fields are now `uoffset`, for reasons described next.
27
28[Zalgo]: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
29
30### Limits Section Added to README
31
32The README now clearly documents that some data structures and iterators
33in `zg` use a `u32`. I've also made it possible to configure the library
34to use a `u64` instead, and have included an explanation of why this is
35not the solution to actual problems which it at first might seem.
36
37My job as maintainer is to provide a useful library to the community, and
38comptime makes it easy and pleasant to tailor types to purpose. So for those
39who see a need for `u64` values in those structures, just pass `-Dfat_offset`
40or its equivalent, and you'll have them.
41
42I believe this to be neither necessary nor sufficient for handling data of
43that size. But I can't anticipate every requirement, and don't want to
44preclude it as a solution.
45
46### Iterators, Back and Forth
47
48A new contributor, Nemoos, took on the challenge of adding a reverse
49iterator to `Graphemes`. Thanks Nemoos!
50
51I've taken the opportunity to fill in a few bits of functionality to
52flesh these out. `code_point` now has a reverse iterator as well, and
53either a forward or backward iterator can be reversed in-place.
54
55Reversing an iterator will always return the last non-`null` result
56of calling that iterator. This is the only sane behavior, but
57might be a bit unexpected without prior warning.
58
59There's also `codePointAtIndex` and `graphemeAtIndex`. These can be
60given any index which falls within the Grapheme or Codepoint which
61is returned. These always return a value, and therefore cannot be
62called on an empty string.
63
64Finally, `Graphemes.iterateAfterGrapheme(string, grapheme)` will
65return a forward iterator which will yield the grapheme after
66`grapheme` when first called. `iterateBeforeGrapheme` has the
67signature and result one might expect from this.
68
69`code_point` doesn't have an equivalent of those, since it isn't
70useful: codepoints are one to four bytes in length, while obtaining
71a grapheme reliably, given only an index, involves some pretty tricky
72business to get right. The `Graphemes` API just described allows
73code to obtain a Grapheme cursor and then begin iterating in either
74direction, by calling `graphemeAtIndex` and providing it to either
75of those functions. For codepoints, starting an iterator at either
76`.offset` or `.offset + .len` will suffice, since the `CodePoint`
77iterator is otherwise stateless.
78
79### Words Module
80
81The [Unicode annex][tr29] with the canonical grapheme segmentation
82algorithm also includes algorithms for word and sentence segmentation.
83`v0.14.1` includes an implementation of the word algorithm.
84
85It works like `Graphemes`. There's forward and reverse iteration,
86`wordAtIndex`, and `iterate(Before|After)Word`.
87
88If anyone is looking for a challenge, there are open issues for sentence
89segmentation and [line breaking][tr14].
90
91[tr29]: https://www.unicode.org/reports/tr29/
92[tr14]: https://www.unicode.org/reports/tr14/
93
94#### Runeset Used
95
96As a point of interest:
97
98Most of the rules in the word breaking algorithm come from a distinct
99property table, `WordBreakProperties.txt` from the [UCD][UCD]. These
100are made into a data structure familiar from the other modules.
101
102One rule, WB3c, uses the Extended Pictographic property. This is also
103used in `Graphemes`, but to avoid a dependency on that library, I used
104a [Runeset][Rune]. This is included statically, with only just as much
105code as needed to recognize the sequences; `zg` itself remains free of
106transitive dependencies.
107
108[UCD]: https://www.unicode.org/reports/tr44/
109[Rune]: https://github.com/mnemnion/runeset
110
3## zg v0.14.0 Release Notes 111## zg v0.14.0 Release Notes
4 112
5This is the first minor point release since Sam Atman (me) took over 113This is the first minor point release since Sam Atman (me) took over
@@ -52,9 +160,9 @@ UTF-8 into codepoints. Concerningly, this interpreted overlong
52sequences, which has been forbidden by Unicode for more than 20 years 160sequences, which has been forbidden by Unicode for more than 20 years
53due to the security risks involved. 161due to the security risks involved.
54 162
55This has been replaced with a DFA decoder based on the work of [Björn 163This has been replaced with a DFA decoder based on the work of
56Höhrmann][UTF], which has proven itself fast[^1] and reliable. This is 164[Björn Höhrmann][UTF], which has proven itself fast[^1] and reliable.
57a breaking change; sequences such as `"\xc0\xaf"` will no longer 165This is a breaking change; sequences such as `"\xc0\xaf"` will no longer
58produce the code `'/'`, nor will surrogates return their codepoint 166produce the code `'/'`, nor will surrogates return their codepoint
59value. 167value.
60 168