summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--NEWS.md108
-rw-r--r--README.md156
-rw-r--r--src/Words.zig17
-rw-r--r--src/code_point.zig3
4 files changed, 257 insertions, 27 deletions
diff --git a/NEWS.md b/NEWS.md
index 8131878..0ccf151 100644
--- a/NEWS.md
+++ b/NEWS.md
@@ -1,5 +1,113 @@
1# News 1# News
2 2
3## zg v0.14.1 Release Notes
4
5In a flurry of activity during and after the `v0.14.0` beta, several
6features were added (including from a new contributor!), and a bug
7fixed.
8
9Presenting `zg v0.14.1`. As should be expected from a patch release,
10there are no breaking changes to the interface, just bug fixes and
11features.
12
13### Grapheme Zalgo Text Bugfix
14
15Until this release, `zg` was using a `u8` to store the length of a
16`Grapheme`. While this is much larger than any "real" grapheme, the
17Unicode grapheme segmentation algorithm allows graphemes of arbitrary
18size to be constructed, often called [Zalgo text][Zalgo] after a
19notorious and funny Stack Overflow answer making use of this affordance.
20
21Therefore, a crafted string could force an integer overflow, with all that
22comes with it. The `.len` field of a `Grapheme` is now a `u32`, like the
23`.offset` field. Due to padding, the `Grapheme` is the same size as it
24was, just making use of the entire 8 bytes.
25
26Actually, both fields are now `uoffset`, for reasons described next.
27
28[Zalgo]: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
29
30### Limits Section Added to README
31
32The README now clearly documents that some data structures and iterators
33in `zg` use a `u32`. I've also made it possible to configure the library
34to use a `u64` instead, and have included an explanation of why this is
35not the solution to actual problems which it at first might seem.
36
37My job as maintainer is to provide a useful library to the community, and
38comptime makes it easy and pleasant to tailor types to purpose. So for those
39who see a need for `u64` values in those structures, just pass `-Dfat_offset`
40or its equivalent, and you'll have them.
41
42I believe this to be neither necessary nor sufficient for handling data of
43that size. But I can't anticipate every requirement, and don't want to
44preclude it as a solution.
45
46### Iterators, Back and Forth
47
48A new contributor, Nemoos, took on the challenge of adding a reverse
49iterator to `Graphemes`. Thanks Nemoos!
50
51I've taken the opportunity to fill in a few bits of functionality to
52flesh these out. `code_point` now has a reverse iterator as well, and
53either a forward or backward iterator can be reversed in-place.
54
55Reversing an iterator will always return the last non-`null` result
56of calling that iterator. This is the only sane behavior, but
57might be a bit unexpected without prior warning.
58
59There's also `codePointAtIndex` and `graphemeAtIndex`. These can be
60given any index which falls within the Grapheme or Codepoint which
61is returned. These always return a value, and therefore cannot be
62called on an empty string.
63
64Finally, `Graphemes.iterateAfterGrapheme(string, grapheme)` will
65return a forward iterator which will yield the grapheme after
66`grapheme` when first called. `iterateBeforeGrapheme` has the
67signature and result one might expect from this.
68
69`code_point` doesn't have an equivalent of those, since it isn't
70useful: codepoints are one to four bytes in length, while obtaining
71a grapheme reliably, given only an index, involves some pretty tricky
72business to get right. The `Graphemes` API just described allows
73code to obtain a Grapheme cursor and then begin iterating in either
74direction, by calling `graphemeAtIndex` and providing it to either
75of those functions. For codepoints, starting an iterator at either
76`.offset` or `.offset + .len` will suffice, since the `CodePoint`
77iterator is otherwise stateless.
78
79### Words Module
80
81The [Unicode annex][tr29] with the canonical grapheme segmentation
82algorithm also includes algorithms for word and sentence segmentation.
83`v0.14.1` includes an implementation of the word algorithm.
84
85It works like `Graphemes`. There's forward and reverse iteration,
86`wordAtIndex`, and `iterate(Before|After)Word`.
87
88If anyone is looking for a challenge, there are open issues for sentence
89segmentation and [line breaking][tr14].
90
91[tr29]: https://www.unicode.org/reports/tr29/
92[tr14]: https://www.unicode.org/reports/tr14/
93
94#### Runeset Used
95
96As a point of interest:
97
98Most of the rules in the word breaking algorithm come from a distinct
99property table, `WordBreakProperties.txt` from the [UCD][UCD]. These
100are made into a data structure familiar from the other modules.
101
102One rule, WB3c, uses the Extended Pictographic property. This is also
103used in `Graphemes`, but to avoid a dependency on that library, I used
104a [Runeset][Rune]. This is included statically, with only just as much
105code as needed to recognize the sequences; `zg` itself remains free of
106transitive dependencies.
107
108[UCD]: https://www.unicode.org/reports/tr44/
109[Rune]: https://github.com/mnemnion/runeset
110
3## zg v0.14.0 Release Notes 111## zg v0.14.0 Release Notes
4 112
5This is the first minor point release since Sam Atman (me) took over 113This is the first minor point release since Sam Atman (me) took over
diff --git a/README.md b/README.md
index 1da50f3..3abe480 100644
--- a/README.md
+++ b/README.md
@@ -2,21 +2,24 @@
2 2
3zg provides Unicode text processing for Zig projects. 3zg provides Unicode text processing for Zig projects.
4 4
5
5## Unicode Version 6## Unicode Version
6 7
7The Unicode version supported by zg is `16.0.0`. 8The Unicode version supported by zg is `16.0.0`.
8 9
10
9## Zig Version 11## Zig Version
10 12
11The minimum Zig version required is `0.14`. 13The minimum Zig version required is `0.14`.
12 14
15
13## Integrating zg into your Zig Project 16## Integrating zg into your Zig Project
14 17
15You first need to add zg as a dependency in your `build.zig.zon` file. In your 18You first need to add zg as a dependency in your `build.zig.zon` file. In your
16Zig project's root directory, run: 19Zig project's root directory, run:
17 20
18```plain 21```plain
19zig fetch --save https://codeberg.org/atman/zg/archive/v0.14.0-rc1.tar.gz 22zig fetch --save https://codeberg.org/atman/zg/archive/v0.14.1.tar.gz
20``` 23```
21 24
22Then instantiate the dependency in your `build.zig`: 25Then instantiate the dependency in your `build.zig`:
@@ -25,12 +28,14 @@ Then instantiate the dependency in your `build.zig`:
25const zg = b.dependency("zg", .{}); 28const zg = b.dependency("zg", .{});
26``` 29```
27 30
31
28## A Modular Approach 32## A Modular Approach
29 33
30zg is a modular library. This approach minimizes binary file size and memory 34zg is a modular library. This approach minimizes binary file size and memory
31requirements by only including the Unicode data required for the specified module. 35requirements by only including the Unicode data required for the specified module.
32The following sections describe the various modules and their specific use case. 36The following sections describe the various modules and their specific use case.
33 37
38
34### Init and Setup 39### Init and Setup
35 40
36The code examples will show the use of `Module.init(allocator)` to create the 41The code examples will show the use of `Module.init(allocator)` to create the
@@ -67,7 +72,7 @@ const code_point = @import("code_point");
67 72
68test "Code point iterator" { 73test "Code point iterator" {
69 const str = "Hi 😊"; 74 const str = "Hi 😊";
70 var iter = code_point.Iterator{ .bytes = str }; 75 var iter: code_point.Iterator = .init(str);
71 var i: usize = 0; 76 var i: usize = 0;
72 77
73 while (iter.next()) |cp| : (i += 1) { 78 while (iter.next()) |cp| : (i += 1) {
@@ -78,25 +83,60 @@ test "Code point iterator" {
78 83
79 if (i == 3) { 84 if (i == 3) {
80 try expect(cp.code == '😊'); 85 try expect(cp.code == '😊');
81
82 // The `offset` field is the byte offset in the 86 // The `offset` field is the byte offset in the
83 // source string. 87 // source string.
84 try expect(cp.offset == 3); 88 try expect(cp.offset == 3);
85 try expectEqual(code_point.CodePoint, code_point.decodeAtIndex(str, cp.offset)); 89 try expectEqual(cp, code_point.decodeAtIndex(str, cp.offset).?);
86
87 // The `len` field is the length in bytes of the 90 // The `len` field is the length in bytes of the
88 // code point in the source string. 91 // code point in the source string.
89 try expect(cp.len == 4); 92 try expect(cp.len == 4);
93 // There is also a 'cursor' decode, like so:
94 {
95 var cursor = cp.offset;
96 try expectEqual(cp, code_point.decodeAtCursor(str, &cursor).?);
97 // Which advances the cursor variable to the next possible
98 // offset, in this case, `str.len`. Don't forget to account
99 // for this possibility!
100 try expectEqual(cp.offset + cp.len, cursor);
101 }
102 // There's also this, for when you aren't sure if you have the
103 // correct start for a code point:
104 try expectEqual(cp, code_point.codepointAtIndex(str, cp.offset + 1).?);
90 } 105 }
106 // Reverse iteration is also an option:
107 var r_iter: code_point.ReverseIterator = .init(str);
108 // Both iterators can be peeked:
109 try expectEqual('😊', r_iter.peek().?.code);
110 try expectEqual('😊', r_iter.prev().?.code);
111 // Both kinds of iterators can be reversed:
112 var fwd_iter = r_iter.forwardIterator(); // or iter.reverseIterator();
113 // This will always return the last codepoint from
114 // the prior iterator, _if_ it yielded one:
115 try expectEqual('😊', fwd_iter.next().?.code);
91 } 116 }
92} 117}
93``` 118```
94 119
120Note that it's safe to call CodePoint functions on invalid
121UTF-8. Iterators and decode functions will return the Unicode
122Replacement Character `U+FFFD`, according to the Substitution of Maximal
123Subparts algorithm, for any invalid code unit sequences encountered.
124
125
95## Grapheme Clusters 126## Grapheme Clusters
96 127
97Many characters are composed from more than one code point. These are known as 128Many characters are composed from more than one code point. These
98Grapheme Clusters, and the `Graphemes` module has a data structure to represent 129are known as Grapheme Clusters, and the `Graphemes` module has a
99them, `Grapheme`, and an `Iterator` to iterate over them in a string. 130data structure to represent them, `Grapheme`, and an `Iterator` and
131`ReverseIterator` to iterate over them in a string.
132
133There is also `graphemeAtIndex`, which returns whatever grapheme
134belongs to the index; this does not have to be on a valid grapheme
135or codepoint boundary, but it is illegal to call on an empty string.
136Last, `iterateAfterGrapheme` or `iterateBeforeGrapheme` will provide
137forward or backward grapheme iterators of the string, from the grapheme
138provided. Thus, given an index, you can begin forward or backward
139iteration at that index without needing to slice the string.
100 140
101In your `build.zig`: 141In your `build.zig`:
102 142
@@ -139,6 +179,56 @@ test "Grapheme cluster iterator" {
139} 179}
140``` 180```
141 181
182
183## Words
184
185Unicode has a standard word segmentation algorithm, which gives good
186results for most languages. Some languages, such as Thai, require a
187dictionary to find the boundary between words; these cases are not
188handled by the standard algorithm.
189
190`zg` implements that algorithm in the `Words` module. As a note,
191the iterators and functions provided here will yield segments which
192are not a "word" in the conventional sense, but word _boundaries_.
193Specifically, the iterators in this module will return every segment of
194a string, ensuring that words are kept whole when encountered. If the
195word breaks are of primary interest, you'll want to use the `.offset`
196field of each iterated value, and handle `string.len` as the final case
197when the iteration returns `null`.
198
199The API is congruent with `Graphemes`: forward and backward iterators,
200`wordAtIndex`, and `iterateAfter` and before.
201
202In your `build.zig`:
203
204```zig
205exe.root_module.addImport("Words", zg.module("Words"));
206```
207
208In your code:
209
210```zig
211const Words = @import("Words");
212
213test "Words" {
214 const wb = try Words.init(testing.allocator);
215 defer wb.deinit(testing.allocator);
216 const word_str = "Metonym Μετωνύμιο メトニム";
217 var w_iter = wb.iterator(word_str);
218 try testing.expectEqualStrings("Metonym", w_iter.next().?.bytes(word_str));
219 // Spaces are "words" too!
220 try testing.expectEqualStrings(" ", w_iter.next().?.bytes(word_str));
221 const in_greek = w_iter.next().?;
222 // wordAtIndex doesn't care if the index is valid for a codepoint:
223 for (in_greek.offset..in_greek.offset + in_greek.len) |i| {
224 const at_index = wb.wordAtIndex(word_str, i).bytes(word_str);
225 try testing.expectEqualStrings("Μετωνύμιο", at_index);
226 }
227 _ = w_iter.next();
228 try testing.expectEqualStrings("メトニム", w_iter.next().?.bytes(word_str));
229}
230```
231
142## Unicode General Categories 232## Unicode General Categories
143 233
144To detect the general category for a code point, use the `GeneralCategories` module. 234To detect the general category for a code point, use the `GeneralCategories` module.
@@ -279,24 +369,24 @@ Unicode normalization is the process of converting a string into a uniform
279representation that can guarantee a known structure by following a strict set 369representation that can guarantee a known structure by following a strict set
280of rules. There are four normalization forms: 370of rules. There are four normalization forms:
281 371
282Canonical Composition (NFC) 372**Canonical Composition (NFC)**
283: The most compact representation obtained by first 373: The most compact representation obtained by first
284decomposing to Canonical Decomposition and then composing to NFC. 374decomposing to Canonical Decomposition and then composing to NFC.
285 375
286Compatibility Composition (NFKC) 376**Compatibility Composition (NFKC)**
287: The most comprehensive composition obtained 377: The most comprehensive composition obtained
288by first decomposing to Compatibility Decomposition and then composing to NFKC. 378by first decomposing to Compatibility Decomposition and then composing to NFKC.
289 379
290Canonical Decomposition (NFD) 380**Canonical Decomposition (NFD)**
291: Only code points with canonical decompositions 381: Only code points with canonical decompositions
292are decomposed. This is a more compact and faster decomposition but will not 382are decomposed. This is a more compact and faster decomposition but will not
293provide the most comprehensive normalization possible. 383provide the most comprehensive normalization possible.
294 384
295Compatibility Decomposition (NFKD) 385**Compatibility Decomposition (NFKD)**
296: The most comprehensive decomposition method 386: The most comprehensive decomposition method
297where both canonical and compatibility decompositions are performed recursively. 387where both canonical and compatibility decompositions are performed recursively.
298 388
299zg has methods to produce all four normalization forms in the `Normalize` module. 389`zg` has methods to produce all four normalization forms in the `Normalize` module.
300 390
301In your `build.zig`: 391In your `build.zig`:
302 392
@@ -493,7 +583,7 @@ in the same fashion as shown for `CaseFolding` and `Normalize`.
493 583
494## Scripts 584## Scripts
495 585
496Unicode categorizes code points by the Script in which they belong. A Script 586Unicode categorizes code points by the Script in which they belong. A Script
497collects letters and other symbols that belong to a particular writing system. 587collects letters and other symbols that belong to a particular writing system.
498You can detect the Script for a code point with the `Scripts` module. 588You can detect the Script for a code point with the `Scripts` module.
499 589
@@ -522,21 +612,33 @@ test "Scripts" {
522 612
523## Limits 613## Limits
524 614
525Iterators, and fragment types such as `CodePoint`, `Grapheme` and `Word`, use a 615Iterators, and fragment types such as `CodePoint`, `Grapheme` and
526`u32` to store the offset into a string, and the length of the fragment 616`Word`, use a `u32` to store the offset into a string, and the length of
527(`CodePoint` uses a `u3` for length, actually). 617the fragment (`CodePoint` uses a `u3` for length, actually).
528 618
5294GiB is a lot of string. There are a few reasons to work with that much 6194GiB is a lot of string. There are a few reasons to work with that much
530string, log files primarily, but fewer to bring it all into memory at once, and 620string, log files primarily, but fewer to bring it all into memory at
531practically no reason at all to do anything to such a string without breaking 621once, and practically no reason at all to do anything to such a string
532it into smaller piece to work with. 622without breaking it into smaller piece to work with.
533 623
534Also, Zig compiles on 32 bit systems, where `usize` is 32. Code running on 624Also, Zig compiles on 32 bit systems, where `usize` is a `u32`. Code
535such systems has no choice but to handle slices in smaller pieces. In general, 625running on such systems has no choice but to handle slices in smaller
536if you want code to perform correctly when encountering multi- gigabyte 626pieces. In general, if you want code to perform correctly when
537strings, you'll need to code for that, at a level one or two steps above that 627encountering multi-gigabyte strings, you'll need to code for that, at a
538in which you'll want to, for example, iterate some graphemes of that string. 628level one or two steps above that in which you'll want to, for example,
629iterate some graphemes of that string.
539 630
540That all said, `zg` modules can be passed the Boolean config option 631That all said, `zg` modules can be passed the Boolean config option
541`fat_offset`, which will make all of those data structures use a `u64` instead. 632`fat_offset`, which will make all of those data structures use a `u64`
542You don't actually want to do this. But you can. 633instead. I added this option not because you should use it, which you
634should not, but to encourage awareness that code operating on strings
635needs to pay attention to the size of those strings, and have a plan for
636when sizes get out of specification. What would your code do with a
6371MiB region of string with no newline? There are many questions of this
638nature, and robust code must detect when data is out of the expected
639envelope, so it can respond accordingly.
640
641Code which does pay attention to these questions has no need for `u64`
642sized offsets, and code which does not will not be helped by them. But
643perhaps yours is an exception, in which case, by all means, configure
644accordingly.
diff --git a/src/Words.zig b/src/Words.zig
index af82562..617c34d 100644
--- a/src/Words.zig
+++ b/src/Words.zig
@@ -674,6 +674,23 @@ test "ext_pict" {
674 try testing.expect(ext_pict.isMatch("\u{2701}")); 674 try testing.expect(ext_pict.isMatch("\u{2701}"));
675} 675}
676 676
677test "Words" {
678 const wb = try Words.init(testing.allocator);
679 defer wb.deinit(testing.allocator);
680 const word_str = "Metonym Μετωνύμιο メトニム";
681 var w_iter = wb.iterator(word_str);
682 try testing.expectEqualStrings("Metonym", w_iter.next().?.bytes(word_str));
683 // Spaces are "words" too!
684 try testing.expectEqualStrings(" ", w_iter.next().?.bytes(word_str));
685 const in_greek = w_iter.next().?;
686 for (in_greek.offset..in_greek.offset + in_greek.len) |i| {
687 const at_index = wb.wordAtIndex(word_str, i).bytes(word_str);
688 try testing.expectEqualStrings("Μετωνύμιο", at_index);
689 }
690 _ = w_iter.next();
691 try testing.expectEqualStrings("メトニム", w_iter.next().?.bytes(word_str));
692}
693
677test wordAtIndex { 694test wordAtIndex {
678 const wb = try Words.init(testing.allocator); 695 const wb = try Words.init(testing.allocator);
679 defer wb.deinit(testing.allocator); 696 defer wb.deinit(testing.allocator);
diff --git a/src/code_point.zig b/src/code_point.zig
index 16648af..7a638af 100644
--- a/src/code_point.zig
+++ b/src/code_point.zig
@@ -121,6 +121,9 @@ pub fn decodeAtCursor(bytes: []const u8, cursor: *uoffset) ?CodePoint {
121 } 121 }
122 if (st == RUNE_REJECT or cursor.* == bytes.len) { 122 if (st == RUNE_REJECT or cursor.* == bytes.len) {
123 @branchHint(.cold); 123 @branchHint(.cold);
124 // This, and the branch below, detect truncation, the
125 // only invalid state handled differently by the Maximal
126 // Subparts algorithm.
124 if (state_dfa[@intCast(u8dfa[byte])] == RUNE_REJECT) { 127 if (state_dfa[@intCast(u8dfa[byte])] == RUNE_REJECT) {
125 cursor.* -= 2; // +1 128 cursor.* -= 2; // +1
126 return .{ 129 return .{