summaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
authorGravatar Jacob Sandlund2025-07-21 23:07:59 -0400
committerGravatar Jacob Sandlund2025-07-21 23:07:59 -0400
commita6122728b265aeb7091e95d135ab83bb3dd1768c (patch)
treece2a3301f29dfb112d18577c7da7748e656a4325 /README.md
parentfix infinity (diff)
parentMerge branch 'develop-next' (diff)
downloadzg-a6122728b265aeb7091e95d135ab83bb3dd1768c.tar.gz
zg-a6122728b265aeb7091e95d135ab83bb3dd1768c.tar.xz
zg-a6122728b265aeb7091e95d135ab83bb3dd1768c.zip
Merge branch 'master' into emoji
Diffstat (limited to 'README.md')
-rw-r--r--README.md230
1 files changed, 136 insertions, 94 deletions
diff --git a/README.md b/README.md
index 5912ce4..0b44402 100644
--- a/README.md
+++ b/README.md
@@ -2,21 +2,24 @@
2 2
3zg provides Unicode text processing for Zig projects. 3zg provides Unicode text processing for Zig projects.
4 4
5
5## Unicode Version 6## Unicode Version
6 7
7The Unicode version supported by zg is `16.0.0`. 8The Unicode version supported by zg is `16.0.0`.
8 9
10
9## Zig Version 11## Zig Version
10 12
11The minimum Zig version required is `0.14`. 13The minimum Zig version required is `0.14`.
12 14
15
13## Integrating zg into your Zig Project 16## Integrating zg into your Zig Project
14 17
15You first need to add zg as a dependency in your `build.zig.zon` file. In your 18You first need to add zg as a dependency in your `build.zig.zon` file. In your
16Zig project's root directory, run: 19Zig project's root directory, run:
17 20
18```plain 21```plain
19zig fetch --save https://codeberg.org/atman/zg/archive/v0.14.0-rc1.tar.gz 22zig fetch --save https://codeberg.org/atman/zg/archive/v0.14.1.tar.gz
20``` 23```
21 24
22Then instantiate the dependency in your `build.zig`: 25Then instantiate the dependency in your `build.zig`:
@@ -25,12 +28,14 @@ Then instantiate the dependency in your `build.zig`:
25const zg = b.dependency("zg", .{}); 28const zg = b.dependency("zg", .{});
26``` 29```
27 30
31
28## A Modular Approach 32## A Modular Approach
29 33
30zg is a modular library. This approach minimizes binary file size and memory 34zg is a modular library. This approach minimizes binary file size and memory
31requirements by only including the Unicode data required for the specified module. 35requirements by only including the Unicode data required for the specified module.
32The following sections describe the various modules and their specific use case. 36The following sections describe the various modules and their specific use case.
33 37
38
34### Init and Setup 39### Init and Setup
35 40
36The code examples will show the use of `Module.init(allocator)` to create the 41The code examples will show the use of `Module.init(allocator)` to create the
@@ -67,7 +72,7 @@ const code_point = @import("code_point");
67 72
68test "Code point iterator" { 73test "Code point iterator" {
69 const str = "Hi ๐Ÿ˜Š"; 74 const str = "Hi ๐Ÿ˜Š";
70 var iter = code_point.Iterator{ .bytes = str }; 75 var iter: code_point.Iterator = .init(str);
71 var i: usize = 0; 76 var i: usize = 0;
72 77
73 while (iter.next()) |cp| : (i += 1) { 78 while (iter.next()) |cp| : (i += 1) {
@@ -78,25 +83,60 @@ test "Code point iterator" {
78 83
79 if (i == 3) { 84 if (i == 3) {
80 try expect(cp.code == '๐Ÿ˜Š'); 85 try expect(cp.code == '๐Ÿ˜Š');
81
82 // The `offset` field is the byte offset in the 86 // The `offset` field is the byte offset in the
83 // source string. 87 // source string.
84 try expect(cp.offset == 3); 88 try expect(cp.offset == 3);
85 try expectEqual(code_point.CodePoint, code_point.decodeAtIndex(str, cp.offset)); 89 try expectEqual(cp, code_point.decodeAtIndex(str, cp.offset).?);
86
87 // The `len` field is the length in bytes of the 90 // The `len` field is the length in bytes of the
88 // code point in the source string. 91 // code point in the source string.
89 try expect(cp.len == 4); 92 try expect(cp.len == 4);
93 // There is also a 'cursor' decode, like so:
94 {
95 var cursor = cp.offset;
96 try expectEqual(cp, code_point.decodeAtCursor(str, &cursor).?);
97 // Which advances the cursor variable to the next possible
98 // offset, in this case, `str.len`. Don't forget to account
99 // for this possibility!
100 try expectEqual(cp.offset + cp.len, cursor);
101 }
102 // There's also this, for when you aren't sure if you have the
103 // correct start for a code point:
104 try expectEqual(cp, code_point.codepointAtIndex(str, cp.offset + 1).?);
90 } 105 }
106 // Reverse iteration is also an option:
107 var r_iter: code_point.ReverseIterator = .init(str);
108 // Both iterators can be peeked:
109 try expectEqual('๐Ÿ˜Š', r_iter.peek().?.code);
110 try expectEqual('๐Ÿ˜Š', r_iter.prev().?.code);
111 // Both kinds of iterators can be reversed:
112 var fwd_iter = r_iter.forwardIterator(); // or iter.reverseIterator();
113 // This will always return the last codepoint from
114 // the prior iterator, _if_ it yielded one:
115 try expectEqual('๐Ÿ˜Š', fwd_iter.next().?.code);
91 } 116 }
92} 117}
93``` 118```
94 119
120Note that it's safe to call CodePoint functions on invalid
121UTF-8. Iterators and decode functions will return the Unicode
122Replacement Character `U+FFFD`, according to the Substitution of Maximal
123Subparts algorithm, for any invalid code unit sequences encountered.
124
125
95## Grapheme Clusters 126## Grapheme Clusters
96 127
97Many characters are composed from more than one code point. These are known as 128Many characters are composed from more than one code point. These
98Grapheme Clusters, and the `Graphemes` module has a data structure to represent 129are known as Grapheme Clusters, and the `Graphemes` module has a
99them, `Grapheme`, and an `Iterator` to iterate over them in a string. 130data structure to represent them, `Grapheme`, and an `Iterator` and
131`ReverseIterator` to iterate over them in a string.
132
133There is also `graphemeAtIndex`, which returns whatever grapheme
134belongs to the index; this does not have to be on a valid grapheme
135or codepoint boundary, but it is illegal to call on an empty string.
136Last, `iterateAfterGrapheme` or `iterateBeforeGrapheme` will provide
137forward or backward grapheme iterators of the string, from the grapheme
138provided. Thus, given an index, you can begin forward or backward
139iteration at that index without needing to slice the string.
100 140
101In your `build.zig`: 141In your `build.zig`:
102 142
@@ -139,6 +179,56 @@ test "Grapheme cluster iterator" {
139} 179}
140``` 180```
141 181
182
183## Words
184
185Unicode has a standard word segmentation algorithm, which gives good
186results for most languages. Some languages, such as Thai, require a
187dictionary to find the boundary between words; these cases are not
188handled by the standard algorithm.
189
190`zg` implements that algorithm in the `Words` module. As a note,
191the iterators and functions provided here will yield segments which
192are not a "word" in the conventional sense, but word _boundaries_.
193Specifically, the iterators in this module will return every segment of
194a string, ensuring that words are kept whole when encountered. If the
195word breaks are of primary interest, you'll want to use the `.offset`
196field of each iterated value, and handle `string.len` as the final case
197when the iteration returns `null`.
198
199The API is congruent with `Graphemes`: forward and backward iterators,
200`wordAtIndex`, and `iterateAfter` and before.
201
202In your `build.zig`:
203
204```zig
205exe.root_module.addImport("Words", zg.module("Words"));
206```
207
208In your code:
209
210```zig
211const Words = @import("Words");
212
213test "Words" {
214 const wb = try Words.init(testing.allocator);
215 defer wb.deinit(testing.allocator);
216 const word_str = "Metonym ฮœฮตฯ„ฯ‰ฮฝฯฮผฮนฮฟ ใƒกใƒˆใƒ‹ใƒ ";
217 var w_iter = wb.iterator(word_str);
218 try testing.expectEqualStrings("Metonym", w_iter.next().?.bytes(word_str));
219 // Spaces are "words" too!
220 try testing.expectEqualStrings(" ", w_iter.next().?.bytes(word_str));
221 const in_greek = w_iter.next().?;
222 // wordAtIndex doesn't care if the index is valid for a codepoint:
223 for (in_greek.offset..in_greek.offset + in_greek.len) |i| {
224 const at_index = wb.wordAtIndex(word_str, i).bytes(word_str);
225 try testing.expectEqualStrings("ฮœฮตฯ„ฯ‰ฮฝฯฮผฮนฮฟ", at_index);
226 }
227 _ = w_iter.next();
228 try testing.expectEqualStrings("ใƒกใƒˆใƒ‹ใƒ ", w_iter.next().?.bytes(word_str));
229}
230```
231
142## Unicode General Categories 232## Unicode General Categories
143 233
144To detect the general category for a code point, use the `GeneralCategories` module. 234To detect the general category for a code point, use the `GeneralCategories` module.
@@ -279,24 +369,24 @@ Unicode normalization is the process of converting a string into a uniform
279representation that can guarantee a known structure by following a strict set 369representation that can guarantee a known structure by following a strict set
280of rules. There are four normalization forms: 370of rules. There are four normalization forms:
281 371
282Canonical Composition (NFC) 372**Canonical Composition (NFC)**
283: The most compact representation obtained by first 373: The most compact representation obtained by first
284decomposing to Canonical Decomposition and then composing to NFC. 374decomposing to Canonical Decomposition and then composing to NFC.
285 375
286Compatibility Composition (NFKC) 376**Compatibility Composition (NFKC)**
287: The most comprehensive composition obtained 377: The most comprehensive composition obtained
288by first decomposing to Compatibility Decomposition and then composing to NFKC. 378by first decomposing to Compatibility Decomposition and then composing to NFKC.
289 379
290Canonical Decomposition (NFD) 380**Canonical Decomposition (NFD)**
291: Only code points with canonical decompositions 381: Only code points with canonical decompositions
292are decomposed. This is a more compact and faster decomposition but will not 382are decomposed. This is a more compact and faster decomposition but will not
293provide the most comprehensive normalization possible. 383provide the most comprehensive normalization possible.
294 384
295Compatibility Decomposition (NFKD) 385**Compatibility Decomposition (NFKD)**
296: The most comprehensive decomposition method 386: The most comprehensive decomposition method
297where both canonical and compatibility decompositions are performed recursively. 387where both canonical and compatibility decompositions are performed recursively.
298 388
299zg has methods to produce all four normalization forms in the `Normalize` module. 389`zg` has methods to produce all four normalization forms in the `Normalize` module.
300 390
301In your `build.zig`: 391In your `build.zig`:
302 392
@@ -493,7 +583,7 @@ in the same fashion as shown for `CaseFolding` and `Normalize`.
493 583
494## Scripts 584## Scripts
495 585
496Unicode categorizes code points by the Script in which they belong. A Script 586Unicode categorizes code points by the Script in which they belong. A Script
497collects letters and other symbols that belong to a particular writing system. 587collects letters and other symbols that belong to a particular writing system.
498You can detect the Script for a code point with the `Scripts` module. 588You can detect the Script for a code point with the `Scripts` module.
499 589
@@ -548,83 +638,35 @@ test "Emoji" {
548} 638}
549``` 639```
550 640
551## Relation to Ziglyph 641## Limits
552 642
553zg is a total re-write of some of the components of Ziglyph. The idea was to 643Iterators, and fragment types such as `CodePoint`, `Grapheme` and
554reduce binary size and improve performance. These goals were achieved by using 644`Word`, use a `u32` to store the offset into a string, and the length of
555trie-like data structures (inspired by [Ghostty's implementation](https://mitchellh.com/writing/ghostty-devlog-006)) 645the fragment (`CodePoint` uses a `u3` for length, actually).
556instead of generated functions. Where Ziglyph uses a function call, zg uses an 646
557array lookup, which is quite faster. In addition, all these data structures in 6474GiB is a lot of string. There are a few reasons to work with that much
558zg are loaded at runtime from compressed versions in the binary. This allows 648string, log files primarily, but fewer to bring it all into memory at
559for smaller binary sizes at the expense of increased memory 649once, and practically no reason at all to do anything to such a string
560footprint at runtime. 650without breaking it into smaller piece to work with.
561 651
562Benchmarks demonstrate the above stated goals have been met: 652Also, Zig compiles on 32 bit systems, where `usize` is a `u32`. Code
563 653running on such systems has no choice but to handle slices in smaller
564```plain 654pieces. In general, if you want code to perform correctly when
565Binary sizes ======= 655encountering multi-gigabyte strings, you'll need to code for that, at a
566 656level one or two steps above that in which you'll want to, for example,
567172K ziglyph_case 657iterate some graphemes of that string.
568109K zg_case 658
569 659That all said, `zg` modules can be passed the Boolean config option
570299K ziglyph_caseless 660`fat_offset`, which will make all of those data structures use a `u64`
571175K zg_caseless 661instead. I added this option not because you should use it, which you
572 662should not, but to encourage awareness that code operating on strings
57391K ziglyph_codepoint 663needs to pay attention to the size of those strings, and have a plan for
57491K zg_codepoint 664when sizes get out of specification. What would your code do with a
575 6651MiB region of string with no newline? There are many questions of this
576108K ziglyph_grapheme 666nature, and robust code must detect when data is out of the expected
577109K zg_grapheme 667envelope, so it can respond accordingly.
578 668
579208K ziglyph_normalizer 669Code which does pay attention to these questions has no need for `u64`
580175K zg_normalize 670sized offsets, and code which does not will not be helped by them. But
581 671perhaps yours is an exception, in which case, by all means, configure
582124K ziglyph_width 672accordingly.
583109K zg_width
584
585Benchmarks ==========
586
587Ziglyph toUpperStr/toLowerStr: result: 7756580, took: 74
588Ziglyph isUpperStr/isLowerStr: result: 110959, took: 17
589zg toUpperStr/toLowerStr: result: 7756580, took: 58
590zg isUpperStr/isLowerStr: result: 110959, took: 11
591
592Ziglyph Normalizer.eqlCaseless: result: 626, took: 479
593zg CaseFolding.canonCaselessMatch: result: 626, took: 296
594zg CaseFolding.compatCaselessMatch: result: 626, took: 604
595
596Ziglyph CodePointIterator: result: 3691806, took: 2.5
597zg code_point.Iterator: result: 3691806, took: 3.3
598
599Ziglyph GraphemeIterator: result: 3691806, took: 78
600zg Graphemes.Iterator: result: 3691806, took: 31
601
602Ziglyph Normalizer.nfkc: result: 3856654, took: 411
603zg Normalize.nfkc: result: 3856654, took: 208
604
605Ziglyph Normalizer.nfc: result: 3878290, took: 56
606zg Normalize.nfc: result: 3878290, took: 31
607
608Ziglyph Normalizer.nfkd: result: 3928890, took: 163
609zg Normalize.nfkd: result: 3928890, took: 101
610
611Ziglyph Normalizer.nfd: result: 3950526, took: 160
612zg Normalize.nfd: result: 3950526, took: 101
613
614Ziglyph Normalizer.eql: result: 626, took: 321
615Zg Normalize.eql: result: 626, took: 60
616
617Ziglyph display_width.strWidth: result: 3700914, took: 89
618zg DisplayWidth.strWidth: result: 3700914, took: 46
619```
620
621These results were obtained on a MacBook Pro (2021) with M1 Pro and 16 GiB of RAM.
622
623In contrast to Ziglyph, zg does not have:
624
625- Word segmentation
626- Sentence segmentation
627- Collation
628
629It's possible that any missing functionality will be added in future versions,
630but only if enough demand is present in the community.