1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
|
# News
## zg v0.16.0-pre Release Notes
This brings another major change to `zg`, touching basically everything.
The `zg` modules no longer require allocation to use. Everything is
kept in static memory, where it should be.
With compression gone in the last release, the inconvenience and startup
penalty of moving the data, already present _in_ static memory, over to
the heap, was purely wasted effort. Just CPU heat and extra clock time.
So that's gone.
90% of the work here was done by Jacob Sandlund, who went on to write
his own rather interesting Unicode library, [uucode][uucode], which you
should check out if you like galaxy-brained comptime shenanigans (I
surely do).
Zig having moved on, I needed to merge that code via massage and
copypasta, so none of the commits survive. But it's his work, so
thanks Jacob!
I did manage to dispose of one last allocation in `CanonData`, by
integrating Vexu's very clever [comptime hash map][chm], so, that's
nice.
[uucode]: https://github.com/jacobsandlund/uucode
[chm]: https://github.com/Vexu/comptime_hash_map/blob/master/src/main.zig
### Migration
Is simplicity itself: just call the module instead of calling the
`.init` function, with an allocator, using what it returns, and
disposing of it when you're done with it.
Technically the laws of style dictate that, since these are now
containers, and not instantiable types, they should be lowercased.
I didn't see the point in changing all those names, it would add labor
to what should be a very brief and pleasant upgrade (but see below).
But feel free!
Pro tip: use LSP superpowers to rename the instance to the name of the
module, then just delete the initializer. Couldn't be simpler.
While further breaking changes are almost certain, this is the last
refactor of this total magnitude which `zg` is likely to see.
### zg: The Module
The take-what-you-need approach, of packaging the interface in a bunch of
separate modules, remains available for those who prefer it.
Or, your code can just import `"zg"`, a module containing all of the
other modules. Zig's lazy compilation model gives us take-what-you-need
already, so while there's no reason to remove the submodules, there's no
reason to prefer using them either.
As mentioned above, none of these are instance types any longer, and that
dictates that they take lower-case (as `code_point` and `ascii` always
have, for that reason). So in `zg`, the modules are styled in lower case.
I did not want to combine a purely stylistic change, and one which would
require editing build scripts, with the functional changes needed to use
the (much nicer!) allocation-free interface. It is possible that later
releases will lowercase the submodules as well, or maybe just remove
them in favor of importing `zg`. Then again, maybe not.
### Emoji Module
Also Jacob's work. Exposes the basic useful Unicode emoji properties.
### `graphemeClusterWidth`
@lch361 submitted a minor refactor which makes it cleaner to obtain the
display width of a grapheme cluster. Thanks Lich!
### Better Fast-pathing in Caseless Comparison
Caseless comparison only tries the ASCII fast-path when strings are the
same size, which is the only time it can work. The fast path has also
been SIMD accelerated when possible.
Canonicalization, and caseless comparision (which uses it), are in need
of attention. They do things in the most expensive possible fashion,
without taking advantage of any opportunities to do the cheaper thing.
While the result is correct, even in pathological cases, it is not
optimal, especially given the reality that Unicode text is, in a modern
context, nearly always in canonical form already.
Changes to that will have to wait for another release, despite my
inclinations to the contrary.
### code_point.decode fully deprecated
Slicing to decode a point is an anti-pattern, and calling this
deprecated function is now a `@compileError`, suggesting `decodeAtIndex`
instead. I suggest taking a look at `decodeAtCursor` as well, which
takes a pointer to an index and moves it to the next codepoint while
decoding, this is often what you want.
A future release will remove the function entirely.
## zg v0.15.2-4 Release Notes
Better late than never. The notes, I mean. But the release too.
This was primarily a matter of getting up to Zig `0.15` compatibility.
Minor bugfixes in word wrapping, and the display width of certain
esoteric emoji, are also included.
Compression of the data was also removed. It was removed from stdlib,
but we could have vendored it: more importantly, it turned out to be
basically useless. Savings per data set were in the bytes to low
KiB range, and startup time was negatively affected.
## zg v0.14.1 Release Notes
In a flurry of activity during and after the `v0.14.0` beta, several
features were added (including from a new contributor!), and a bug
fixed.
Presenting `zg v0.14.1`. As should be expected from a patch release,
there are no breaking changes to the interface, just bug fixes and
features.
### Grapheme Zalgo Text Bugfix
Until this release, `zg` was using a `u8` to store the length of a
`Grapheme`. While this is much larger than any "real" grapheme, the
Unicode grapheme segmentation algorithm allows graphemes of arbitrary
size to be constructed, often called [Zalgo text][Zalgo] after a
notorious and funny Stack Overflow answer making use of this affordance.
Therefore, a crafted string could force an integer overflow, with all that
comes with it. The `.len` field of a `Grapheme` is now a `u32`, like the
`.offset` field. Due to padding, the `Grapheme` is the same size as it
was, just making use of the entire 8 bytes.
Actually, both fields are now `uoffset`, for reasons described next.
[Zalgo]: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
### Limits Section Added to README
The README now clearly documents that some data structures and iterators
in `zg` use a `u32`. I've also made it possible to configure the library
to use a `u64` instead, and have included an explanation of why this is
not the solution to actual problems which it at first might seem.
My job as maintainer is to provide a useful library to the community, and
comptime makes it easy and pleasant to tailor types to purpose. So for those
who see a need for `u64` values in those structures, just pass `-Dfat_offset`
or its equivalent, and you'll have them.
I believe this to be neither necessary nor sufficient for handling data of
that size. But I can't anticipate every requirement, and don't want to
preclude it as a solution.
### Iterators, Back and Forth
A new contributor, Nemoos, took on the challenge of adding a reverse
iterator to `Graphemes`. Thanks Nemoos!
I've taken the opportunity to fill in a few bits of functionality to
flesh these out. `code_point` now has a reverse iterator as well, and
either a forward or backward iterator can be reversed in-place.
Reversing an iterator will always return the last non-`null` result
of calling that iterator. This is the only sane behavior, but
might be a bit unexpected without prior warning.
There's also `codePointAtIndex` and `graphemeAtIndex`. These can be
given any index which falls within the Grapheme or Codepoint which
is returned. These always return a value, and therefore cannot be
called on an empty string.
Finally, `Graphemes.iterateAfterGrapheme(string, grapheme)` will
return a forward iterator which will yield the grapheme after
`grapheme` when first called. `iterateBeforeGrapheme` has the
signature and result one might expect from this.
`code_point` doesn't have an equivalent of those, since it isn't
useful: codepoints are one to four bytes in length, while obtaining
a grapheme reliably, given only an index, involves some pretty tricky
business to get right. The `Graphemes` API just described allows
code to obtain a Grapheme cursor and then begin iterating in either
direction, by calling `graphemeAtIndex` and providing it to either
of those functions. For codepoints, starting an iterator at either
`.offset` or `.offset + .len` will suffice, since the `CodePoint`
iterator is otherwise stateless.
### Words Module
The [Unicode annex][tr29] with the canonical grapheme segmentation
algorithm also includes algorithms for word and sentence segmentation.
`v0.14.1` includes an implementation of the word algorithm.
It works like `Graphemes`. There's forward and reverse iteration,
`wordAtIndex`, and `iterate(Before|After)Word`.
If anyone is looking for a challenge, there are open issues for sentence
segmentation and [line breaking][tr14].
[tr29]: https://www.unicode.org/reports/tr29/
[tr14]: https://www.unicode.org/reports/tr14/
#### Runeset Used
As a point of interest:
Most of the rules in the word breaking algorithm come from a distinct
property table, `WordBreakProperties.txt` from the [UCD][UCD]. These
are made into a data structure familiar from the other modules.
One rule, WB3c, uses the Extended Pictographic property. This is also
used in `Graphemes`, but to avoid a dependency on that library, I used
a [Runeset][Rune]. This is included statically, with only just as much
code as needed to recognize the sequences; `zg` itself remains free of
transitive dependencies.
[UCD]: https://www.unicode.org/reports/tr44/
[Rune]: https://github.com/mnemnion/runeset
## zg v0.14.0 Release Notes
This is the first minor point release since Sam Atman (me) took over
maintenance of `zg` from the inimitable José Colon Rodriguez, aka
@dude_the_builder. We're all grateful for everything he's done for
the Zig community.
The changes are fairly large, and most user code will need to be updated.
The result is substantially streamlined and easier to use, and updating
will mainly take place around importing, creating, and deinitializing.
### The Great Renaming
The most obvious change is on the surface API: more than half of the
modules have been renamed. There are no user-facing modules with `Data`
in the name, and some abbreviations have been spelled in full.
### No More Separation of Data and Functionality
It is no longer necessary to separately create, for example, a
`GraphemeData` structure, in order to use the functionality provided
by the `grapheme` module.
Instead there's just `Graphemes`, and the same for a couple of other
modules which worked the same way. This means that the cases where
functionality was provided by a wrapped pointer are now provided
directly from the struct with the necessary data.
This would make user structs larger in some cases, while eliminating a
pointer chase. If that isn't a desirable trade off for your code,
read on.
### All Allocated Data is Unmanaged
Prior to `v0.14`, all structs which need heap allocation no longer
have a copy of their allocator. We felt that this was redundant,
especially when several such structures were in use, and it reflects
a general trend in the standard library toward fewer managed data
structures.
Getting up to speed is a matter of passing the allocator to `deinit`.
This change comes courtesy of [lch361](https://lch361.net), in his
first contribution to the repo. Thanks Lich!
### `code_point` Now Unicode-Compliant
The `v0.15.x` decoder used a simple, fast, but naïve method to decode
UTF-8 into codepoints. Concerningly, this interpreted overlong
sequences, which has been forbidden by Unicode for more than 20 years
due to the security risks involved.
This has been replaced with a DFA decoder based on the work of
[Björn Höhrmann][UTF], which has proven itself fast[^1] and reliable.
This is a breaking change; sequences such as `"\xc0\xaf"` will no longer
produce the code `'/'`, nor will surrogates return their codepoint
value.
The new decoder faithfully implements §3.9.6 of the Unicode Standard,
_U+FFFD Substitution of Maximal Subparts_. While this is itself not
required to claim Unicode conformance, it is the W3C specification for
replacement behavior.
Along with this, `code_point.decode` is deprecated, and will be removed
in a later version of `zg`. It was basically an exposed piece of the
`Iterator` implementation, and is no longer used in that capacity.
Instead, prefer `decodeAtIndex([]const u8, u32) ?CodePoint`, or better
yet, `decodeAtCursor([]const u8, *u32)`. The latter advances its
second argument to the next possible index for a valid codepoint, which
is good for the fetch pipeline, and more ergonomic in many cases.
[UTF]: https://bjoern.hoehrmann.de/utf-8/decoder/dfa/
[^1]: A bit more than twice as fast as the standard library for
decoding, according to my (limited) benchmarks.
### DisplayWidth and CaseFolding Can Share Data
Both of these modules use another module to get the job done,
`Graphemes` for `DisplayWidth`, and `Normalize` for `CaseFolding`.
It is now possible to initialize them with a borrowed copy of those
modules, to make it simpler to write code which also needs the base
modules.
### Grapheme Iterator Creation
This is a modest streamlining of how a grapheme iterator is created.
Before:
```zig
const gd = try grapheme.GraphemeData.init(allocator);
defer gd.deinit();
var iter = grapheme.Iterator.init("🤘🏻some rad string! 🤘🏿", &gd);
```
Now:
```zig
const graphemes = try Graphemes.init(allocator);
defer graphemes.deinit(allocator);
var iter = graphemes.iterator("🤘🏻some rad string! 🤘🏿");
```
It remains possible to use
```zig
var iter = Graphemes.Iterator.init("stri̵̢̡̡̡̨̧̡̨̡̡̡̨̫̗̗̱̳̼̖͚͉̩̬̬͚̟̣̮̬̙̖̗͇̮͓̻̫͍͎͉͎̹̩̗͖͈̙̻̭̝̭̼̙̯̪͚̙͉͎͎͖̥̹͈̫͍̹͓̘̙͎͖̝̦͎̤̼̹͕͈̪̙̪̯̯͙̝͈͕̬̪̗̭͎͖̟͚̦̣̘͙̞̮̹̙͚̼̤̟͉̭͔̩͍͔͈̯͎̘͎̭̥̖̜͙̖̖͍̼͙͎͚̦̮̹̞̺͍̳̖̹̼̲̠̩̰̳͂̌̈́̓̄͋̇̎͜͜͠ͅͅͅͅng", &graphemes);
```
If one were to prefer doing so.
### Initialization vs. Setup
Every allocating module now has both an `init` function, which
returns the created struct, and a `setup` function. The latter
takes a mutable pointer, and an `Allocator`, returning
`Allocator.Error!void`.
So those who might prefer a single-pointer home for such modules
can allocate the struct on the heap with `allocator.create`, or
add a pointer field to some other struct, then use `setup` to
populate it.
In the process, the various spurious reader and decompression errors
have been turned `unreachable`, leaving only `error.OutOfMemory`.
Encountering any of the other errors would indicate an internal problem,
so we no longer make user code deal with that unlikely event.
### New DisplayWidth options
A `DisplayWidth` can now be compiled to treat `c0` and `c1` control codes
as having a width. Canonically, terminals don't print them, so they
would have a width of 0. However, some applications (`vim` for example)
need to escape control codes to make them visible. Setting these
options will let `DisplayWidth` return the correct widths when this
is done.
### Unicode 16.0
This updates `zg` to use the latest Unicode edition. This should be
the only change which will change behavior of user code, other than
through the use of the new `DisplayWidth` options.
### Tests
It is now possible to run all the tests, not just the `unicode-test`
subset. Accordingly, that step is removed, and `zig build test`
runs everything.
#### Allocations Tested
Every allocate-able now has a `checkAllAllocationFailures` test. This
process turned up two bugs. Also discovered were 8,663 allocations,
which were reduced to two, these were also being individually freed
on deinit. So that's nice.
#### That's It!
I hope you find converting over `zg v0.13` code to be fairly painless
and straightforward. There should be no need to make changes of this
magnitude in the future.
|