diff options
| author | 2025-07-08 12:15:32 -0400 | |
|---|---|---|
| committer | 2025-07-08 12:15:32 -0400 | |
| commit | 9427a9e53aaa29ee071f4dcb35b809a699d75aa9 (patch) | |
| tree | 2607c185fd8053b84d60041fadc35c05a0225d34 /NEWS.md | |
| parent | Merge pull request 'Fix benchmarks' (#56) from jacobsandlund/zg:benchmarks in... (diff) | |
| parent | Add Words.zig example to README (diff) | |
| download | zg-master.tar.gz zg-master.tar.xz zg-master.zip | |
Diffstat (limited to 'NEWS.md')
| -rw-r--r-- | NEWS.md | 114 |
1 files changed, 111 insertions, 3 deletions
| @@ -1,5 +1,113 @@ | |||
| 1 | # News | 1 | # News |
| 2 | 2 | ||
| 3 | ## zg v0.14.1 Release Notes | ||
| 4 | |||
| 5 | In a flurry of activity during and after the `v0.14.0` beta, several | ||
| 6 | features were added (including from a new contributor!), and a bug | ||
| 7 | fixed. | ||
| 8 | |||
| 9 | Presenting `zg v0.14.1`. As should be expected from a patch release, | ||
| 10 | there are no breaking changes to the interface, just bug fixes and | ||
| 11 | features. | ||
| 12 | |||
| 13 | ### Grapheme Zalgo Text Bugfix | ||
| 14 | |||
| 15 | Until this release, `zg` was using a `u8` to store the length of a | ||
| 16 | `Grapheme`. While this is much larger than any "real" grapheme, the | ||
| 17 | Unicode grapheme segmentation algorithm allows graphemes of arbitrary | ||
| 18 | size to be constructed, often called [Zalgo text][Zalgo] after a | ||
| 19 | notorious and funny Stack Overflow answer making use of this affordance. | ||
| 20 | |||
| 21 | Therefore, a crafted string could force an integer overflow, with all that | ||
| 22 | comes with it. The `.len` field of a `Grapheme` is now a `u32`, like the | ||
| 23 | `.offset` field. Due to padding, the `Grapheme` is the same size as it | ||
| 24 | was, just making use of the entire 8 bytes. | ||
| 25 | |||
| 26 | Actually, both fields are now `uoffset`, for reasons described next. | ||
| 27 | |||
| 28 | [Zalgo]: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 | ||
| 29 | |||
| 30 | ### Limits Section Added to README | ||
| 31 | |||
| 32 | The README now clearly documents that some data structures and iterators | ||
| 33 | in `zg` use a `u32`. I've also made it possible to configure the library | ||
| 34 | to use a `u64` instead, and have included an explanation of why this is | ||
| 35 | not the solution to actual problems which it at first might seem. | ||
| 36 | |||
| 37 | My job as maintainer is to provide a useful library to the community, and | ||
| 38 | comptime makes it easy and pleasant to tailor types to purpose. So for those | ||
| 39 | who see a need for `u64` values in those structures, just pass `-Dfat_offset` | ||
| 40 | or its equivalent, and you'll have them. | ||
| 41 | |||
| 42 | I believe this to be neither necessary nor sufficient for handling data of | ||
| 43 | that size. But I can't anticipate every requirement, and don't want to | ||
| 44 | preclude it as a solution. | ||
| 45 | |||
| 46 | ### Iterators, Back and Forth | ||
| 47 | |||
| 48 | A new contributor, Nemoos, took on the challenge of adding a reverse | ||
| 49 | iterator to `Graphemes`. Thanks Nemoos! | ||
| 50 | |||
| 51 | I've taken the opportunity to fill in a few bits of functionality to | ||
| 52 | flesh these out. `code_point` now has a reverse iterator as well, and | ||
| 53 | either a forward or backward iterator can be reversed in-place. | ||
| 54 | |||
| 55 | Reversing an iterator will always return the last non-`null` result | ||
| 56 | of calling that iterator. This is the only sane behavior, but | ||
| 57 | might be a bit unexpected without prior warning. | ||
| 58 | |||
| 59 | There's also `codePointAtIndex` and `graphemeAtIndex`. These can be | ||
| 60 | given any index which falls within the Grapheme or Codepoint which | ||
| 61 | is returned. These always return a value, and therefore cannot be | ||
| 62 | called on an empty string. | ||
| 63 | |||
| 64 | Finally, `Graphemes.iterateAfterGrapheme(string, grapheme)` will | ||
| 65 | return a forward iterator which will yield the grapheme after | ||
| 66 | `grapheme` when first called. `iterateBeforeGrapheme` has the | ||
| 67 | signature and result one might expect from this. | ||
| 68 | |||
| 69 | `code_point` doesn't have an equivalent of those, since it isn't | ||
| 70 | useful: codepoints are one to four bytes in length, while obtaining | ||
| 71 | a grapheme reliably, given only an index, involves some pretty tricky | ||
| 72 | business to get right. The `Graphemes` API just described allows | ||
| 73 | code to obtain a Grapheme cursor and then begin iterating in either | ||
| 74 | direction, by calling `graphemeAtIndex` and providing it to either | ||
| 75 | of those functions. For codepoints, starting an iterator at either | ||
| 76 | `.offset` or `.offset + .len` will suffice, since the `CodePoint` | ||
| 77 | iterator is otherwise stateless. | ||
| 78 | |||
| 79 | ### Words Module | ||
| 80 | |||
| 81 | The [Unicode annex][tr29] with the canonical grapheme segmentation | ||
| 82 | algorithm also includes algorithms for word and sentence segmentation. | ||
| 83 | `v0.14.1` includes an implementation of the word algorithm. | ||
| 84 | |||
| 85 | It works like `Graphemes`. There's forward and reverse iteration, | ||
| 86 | `wordAtIndex`, and `iterate(Before|After)Word`. | ||
| 87 | |||
| 88 | If anyone is looking for a challenge, there are open issues for sentence | ||
| 89 | segmentation and [line breaking][tr14]. | ||
| 90 | |||
| 91 | [tr29]: https://www.unicode.org/reports/tr29/ | ||
| 92 | [tr14]: https://www.unicode.org/reports/tr14/ | ||
| 93 | |||
| 94 | #### Runeset Used | ||
| 95 | |||
| 96 | As a point of interest: | ||
| 97 | |||
| 98 | Most of the rules in the word breaking algorithm come from a distinct | ||
| 99 | property table, `WordBreakProperties.txt` from the [UCD][UCD]. These | ||
| 100 | are made into a data structure familiar from the other modules. | ||
| 101 | |||
| 102 | One rule, WB3c, uses the Extended Pictographic property. This is also | ||
| 103 | used in `Graphemes`, but to avoid a dependency on that library, I used | ||
| 104 | a [Runeset][Rune]. This is included statically, with only just as much | ||
| 105 | code as needed to recognize the sequences; `zg` itself remains free of | ||
| 106 | transitive dependencies. | ||
| 107 | |||
| 108 | [UCD]: https://www.unicode.org/reports/tr44/ | ||
| 109 | [Rune]: https://github.com/mnemnion/runeset | ||
| 110 | |||
| 3 | ## zg v0.14.0 Release Notes | 111 | ## zg v0.14.0 Release Notes |
| 4 | 112 | ||
| 5 | This is the first minor point release since Sam Atman (me) took over | 113 | This is the first minor point release since Sam Atman (me) took over |
| @@ -52,9 +160,9 @@ UTF-8 into codepoints. Concerningly, this interpreted overlong | |||
| 52 | sequences, which has been forbidden by Unicode for more than 20 years | 160 | sequences, which has been forbidden by Unicode for more than 20 years |
| 53 | due to the security risks involved. | 161 | due to the security risks involved. |
| 54 | 162 | ||
| 55 | This has been replaced with a DFA decoder based on the work of [Björn | 163 | This has been replaced with a DFA decoder based on the work of |
| 56 | Höhrmann][UTF], which has proven itself fast[^1] and reliable. This is | 164 | [Björn Höhrmann][UTF], which has proven itself fast[^1] and reliable. |
| 57 | a breaking change; sequences such as `"\xc0\xaf"` will no longer | 165 | This is a breaking change; sequences such as `"\xc0\xaf"` will no longer |
| 58 | produce the code `'/'`, nor will surrogates return their codepoint | 166 | produce the code `'/'`, nor will surrogates return their codepoint |
| 59 | value. | 167 | value. |
| 60 | 168 | ||