Merge branch 'master' into emoji

author: Jacob Sandlund 2025-07-21 23:07:59 -0400
committer: Jacob Sandlund 2025-07-21 23:07:59 -0400
commit: a6122728b265aeb7091e95d135ab83bb3dd1768c (patch)
tree: ce2a3301f29dfb112d18577c7da7748e656a4325 /README.md
parent: fix infinity (diff)
parent: Merge branch 'develop-next' (diff)
download: zg-a6122728b265aeb7091e95d135ab83bb3dd1768c.tar.gz
zg-a6122728b265aeb7091e95d135ab83bb3dd1768c.tar.xz
zg-a6122728b265aeb7091e95d135ab83bb3dd1768c.zip
1 files changed, 136 insertions, 94 deletions
diff --git a/README.md b/README.md
index 5912ce4..0b44402 100644
--- a/README.md
+++ b/README.md
@@ -2,21 +2,24 @@
 zg provides Unicode text processing for Zig projects.
 ## Unicode Version
 The Unicode version supported by zg is `16.0.0`.
 ## Zig Version
 The minimum Zig version required is `0.14`.
 ## Integrating zg into your Zig Project
 You first need to add zg as a dependency in your `build.zig.zon` file. In your
 Zig project's root directory, run:
 ```plain
-zig fetch --save https://codeberg.org/atman/zg/archive/v0.14.0-rc1.tar.gz
+zig fetch --save https://codeberg.org/atman/zg/archive/v0.14.1.tar.gz
 ```
 Then instantiate the dependency in your `build.zig`:
@@ -25,12 +28,14 @@ Then instantiate the dependency in your `build.zig`:
 const zg = b.dependency("zg", .{});
 ```
 ## A Modular Approach
 zg is a modular library. This approach minimizes binary file size and memory
 requirements by only including the Unicode data required for the specified module.
 The following sections describe the various modules and their specific use case.
 ### Init and Setup
 The code examples will show the use of `Module.init(allocator)` to create the
@@ -67,7 +72,7 @@ const code_point = @import("code_point");
 test "Code point iterator" {
    const str = "Hi 😊";
-    var iter = code_point.Iterator{ .bytes = str };
+    var iter: code_point.Iterator = .init(str);
    var i: usize = 0;
    while (iter.next()) |cp| : (i += 1) {
@@ -78,25 +83,60 @@ test "Code point iterator" {
        if (i == 3) {
            try expect(cp.code == '😊');
            // The `offset` field is the byte offset in the
            // source string.
            try expect(cp.offset == 3);
-            try expectEqual(code_point.CodePoint, code_point.decodeAtIndex(str, cp.offset));
+            try expectEqual(cp, code_point.decodeAtIndex(str, cp.offset).?);
            // The `len` field is the length in bytes of the
            // code point in the source string.
            try expect(cp.len == 4);
+            // There is also a 'cursor' decode, like so:
+            {
+                var cursor = cp.offset;
+                try expectEqual(cp, code_point.decodeAtCursor(str, &cursor).?);
+                // Which advances the cursor variable to the next possible
+                // offset, in this case, `str.len`.  Don't forget to account
+                // for this possibility!
+                try expectEqual(cp.offset + cp.len, cursor);
+            }
+            // There's also this, for when you aren't sure if you have the
+            // correct start for a code point:
+            try expectEqual(cp, code_point.codepointAtIndex(str, cp.offset + 1).?);
        }
+        // Reverse iteration is also an option:
+        var r_iter: code_point.ReverseIterator = .init(str);
+        // Both iterators can be peeked:
+        try expectEqual('😊', r_iter.peek().?.code);
+        try expectEqual('😊', r_iter.prev().?.code);
+        // Both kinds of iterators can be reversed:
+        var fwd_iter = r_iter.forwardIterator(); // or iter.reverseIterator();
+        // This will always return the last codepoint from
+        // the prior iterator, _if_ it yielded one:
+        try expectEqual('😊', fwd_iter.next().?.code);
    }
 }
 ```
+Note that it's safe to call CodePoint functions on invalid
+UTF-8. Iterators and decode functions will return the Unicode
+Replacement Character `U+FFFD`, according to the Substitution of Maximal
+Subparts algorithm, for any invalid code unit sequences encountered.
 ## Grapheme Clusters
-Many characters are composed from more than one code point. These are known as
+Many characters are composed from more than one code point. These
-Grapheme Clusters, and the `Graphemes` module has a data structure to represent
+are known as Grapheme Clusters, and the `Graphemes` module has a
-them, `Grapheme`, and an `Iterator` to iterate over them in a string.
+data structure to represent them, `Grapheme`, and an `Iterator` and
+`ReverseIterator` to iterate over them in a string.
+There is also `graphemeAtIndex`, which returns whatever grapheme
+belongs to the index; this does not have to be on a valid grapheme
+or codepoint boundary, but it is illegal to call on an empty string.
+Last, `iterateAfterGrapheme` or `iterateBeforeGrapheme` will provide
+forward or backward grapheme iterators of the string, from the grapheme
+provided.  Thus, given an index, you can begin forward or backward
+iteration at that index without needing to slice the string.
 In your `build.zig`:
@@ -139,6 +179,56 @@ test "Grapheme cluster iterator" {
 }
 ```
+## Words
+Unicode has a standard word segmentation algorithm, which gives good
+results for most languages.  Some languages, such as Thai, require a
+dictionary to find the boundary between words; these cases are not
+handled by the standard algorithm.
+`zg` implements that algorithm in the `Words` module.  As a note,
+the iterators and functions provided here will yield segments which
+are not a "word" in the conventional sense, but word _boundaries_.
+Specifically, the iterators in this module will return every segment of
+a string, ensuring that words are kept whole when encountered.  If the
+word breaks are of primary interest, you'll want to use the `.offset`
+field of each iterated value, and handle `string.len` as the final case
+when the iteration returns `null`.
+The API is congruent with `Graphemes`: forward and backward iterators,
+`wordAtIndex`, and `iterateAfter` and before.
+In your `build.zig`:
+```zig
+exe.root_module.addImport("Words", zg.module("Words"));
+```
+In your code:
+```zig
+const Words = @import("Words");
+test "Words" {
+    const wb = try Words.init(testing.allocator);
+    defer wb.deinit(testing.allocator);
+    const word_str = "Metonym   Μετωνύμιο メトニム";
+    var w_iter = wb.iterator(word_str);
+    try testing.expectEqualStrings("Metonym", w_iter.next().?.bytes(word_str));
+    // Spaces are "words" too!
+    try testing.expectEqualStrings("   ", w_iter.next().?.bytes(word_str));
+    const in_greek = w_iter.next().?;
+    // wordAtIndex doesn't care if the index is valid for a codepoint:
+    for (in_greek.offset..in_greek.offset + in_greek.len) |i| {
+        const at_index = wb.wordAtIndex(word_str, i).bytes(word_str);
+        try testing.expectEqualStrings("Μετωνύμιο", at_index);
+    }
+    _ = w_iter.next();
+    try testing.expectEqualStrings("メトニム", w_iter.next().?.bytes(word_str));
+}
+```
 ## Unicode General Categories
 To detect the general category for a code point, use the `GeneralCategories` module.
@@ -279,24 +369,24 @@ Unicode normalization is the process of converting a string into a uniform
 representation that can guarantee a known structure by following a strict set
 of rules. There are four normalization forms:
-Canonical Composition (NFC)
+**Canonical Composition (NFC)**
 : The most compact representation obtained by first
 decomposing to Canonical Decomposition and then composing to NFC.
-Compatibility Composition (NFKC)
+**Compatibility Composition (NFKC)**
 : The most comprehensive composition obtained
 by first decomposing to Compatibility Decomposition and then composing to NFKC.
-Canonical Decomposition (NFD)
+**Canonical Decomposition (NFD)**
 : Only code points with canonical decompositions
 are decomposed. This is a more compact and faster decomposition but will not
 provide the most comprehensive normalization possible.
-Compatibility Decomposition (NFKD)
+**Compatibility Decomposition (NFKD)**
 : The most comprehensive decomposition method
 where both canonical and compatibility decompositions are performed recursively.
-zg has methods to produce all four normalization forms in the `Normalize` module.
+`zg` has methods to produce all four normalization forms in the `Normalize` module.
 In your `build.zig`:
@@ -493,7 +583,7 @@ in the same fashion as shown for `CaseFolding` and `Normalize`.
 ## Scripts
-Unicode categorizes code points by the Script in which they belong. A Script
+Unicode categorizes code points by the Script in which they belong.  A Script
 collects letters and other symbols that belong to a particular writing system.
 You can detect the Script for a code point with the `Scripts` module.
@@ -548,83 +638,35 @@ test "Emoji" {
 }
 ```
-## Relation to Ziglyph
+## Limits
-zg is a total re-write of some of the components of Ziglyph. The idea was to
+Iterators, and fragment types such as `CodePoint`, `Grapheme` and
-reduce binary size and improve performance. These goals were achieved by using
+`Word`, use a `u32` to store the offset into a string, and the length of
-trie-like data structures (inspired by [Ghostty's implementation](https://mitchellh.com/writing/ghostty-devlog-006))
+the fragment (`CodePoint` uses a `u3` for length, actually).
-instead of generated functions. Where Ziglyph uses a function call, zg uses an
-array lookup, which is quite faster. In addition, all these data structures in
+4GiB is a lot of string.  There are a few reasons to work with that much
-zg are loaded at runtime from compressed versions in the binary. This allows
+string, log files primarily, but fewer to bring it all into memory at
-for smaller binary sizes at the expense of increased memory
+once, and practically no reason at all to do anything to such a string
-footprint at runtime.
+without breaking it into smaller piece to work with.
-Benchmarks demonstrate the above stated goals have been met:
+Also, Zig compiles on 32 bit systems, where `usize` is a `u32`.  Code
+running on such systems has no choice but to handle slices in smaller
-```plain
+pieces.  In general, if you want code to perform correctly when
-Binary sizes =======
+encountering multi-gigabyte strings, you'll need to code for that, at a
+level one or two steps above that in which you'll want to, for example,
-172K ziglyph_case
+iterate some graphemes of that string.
-109K zg_case
+That all said, `zg` modules can be passed the Boolean config option
-299K ziglyph_caseless
+`fat_offset`, which will make all of those data structures use a `u64`
-175K zg_caseless
+instead.  I added this option not because you should use it, which you
+should not, but to encourage awareness that code operating on strings
-91K ziglyph_codepoint
+needs to pay attention to the size of those strings, and have a plan for
-91K zg_codepoint
+when sizes get out of specification.  What would your code do with a
+1MiB region of string with no newline?  There are many questions of this
-108K ziglyph_grapheme
+nature, and robust code must detect when data is out of the expected
-109K zg_grapheme
+envelope, so it can respond accordingly.
-208K ziglyph_normalizer
+Code which does pay attention to these questions has no need for `u64`
-175K zg_normalize
+sized offsets, and code which does not will not be helped by them.  But
+perhaps yours is an exception, in which case, by all means, configure
-124K ziglyph_width
+accordingly.
-109K zg_width
-Benchmarks ==========
-Ziglyph toUpperStr/toLowerStr: result: 7756580, took: 74
-Ziglyph isUpperStr/isLowerStr: result: 110959, took: 17
-zg toUpperStr/toLowerStr: result: 7756580, took: 58
-zg isUpperStr/isLowerStr: result: 110959, took: 11
-Ziglyph Normalizer.eqlCaseless: result: 626, took: 479
-zg CaseFolding.canonCaselessMatch: result: 626, took: 296
-zg CaseFolding.compatCaselessMatch: result: 626, took: 604
-Ziglyph CodePointIterator: result: 3691806, took: 2.5
-zg code_point.Iterator: result: 3691806, took: 3.3
-Ziglyph GraphemeIterator: result: 3691806, took: 78
-zg Graphemes.Iterator: result: 3691806, took: 31
-Ziglyph Normalizer.nfkc: result: 3856654, took: 411
-zg Normalize.nfkc: result: 3856654, took: 208
-Ziglyph Normalizer.nfc: result: 3878290, took: 56
-zg Normalize.nfc: result: 3878290, took: 31
-Ziglyph Normalizer.nfkd: result: 3928890, took: 163
-zg Normalize.nfkd: result: 3928890, took: 101
-Ziglyph Normalizer.nfd: result: 3950526, took: 160
-zg Normalize.nfd: result: 3950526, took: 101
-Ziglyph Normalizer.eql: result: 626, took: 321
-Zg Normalize.eql: result: 626, took: 60
-Ziglyph display_width.strWidth: result: 3700914, took: 89
-zg DisplayWidth.strWidth: result: 3700914, took: 46
-```
-These results were obtained on a MacBook Pro (2021) with M1 Pro and 16 GiB of RAM.
-In contrast to Ziglyph, zg does not have:
- Word segmentation
- Sentence segmentation
- Collation
-It's possible that any missing functionality will be added in future versions,
-but only if enough demand is present in the community.
author	Jacob Sandlund	2025-07-21 23:07:59 -0400
committer	Jacob Sandlund	2025-07-21 23:07:59 -0400
commit	a6122728b265aeb7091e95d135ab83bb3dd1768c (patch)
tree	ce2a3301f29dfb112d18577c7da7748e656a4325 /README.md
parent	fix infinity (diff)
parent	Merge branch 'develop-next' (diff)
download	zg-a6122728b265aeb7091e95d135ab83bb3dd1768c.tar.gz zg-a6122728b265aeb7091e95d135ab83bb3dd1768c.tar.xz zg-a6122728b265aeb7091e95d135ab83bb3dd1768c.zip

diff --git a/README.md b/README.md index 5912ce4..0b44402 100644 --- a/README.md +++ b/README.md
@@ -2,21 +2,24 @@
2		2
3	zg provides Unicode text processing for Zig projects.	3	zg provides Unicode text processing for Zig projects.
4		4
		5
5	## Unicode Version	6	## Unicode Version
6		7
7	The Unicode version supported by zg is `16.0.0`.	8	The Unicode version supported by zg is `16.0.0`.
8		9
		10
9	## Zig Version	11	## Zig Version
10		12
11	The minimum Zig version required is `0.14`.	13	The minimum Zig version required is `0.14`.
12		14
		15
13	## Integrating zg into your Zig Project	16	## Integrating zg into your Zig Project
14		17
15	You first need to add zg as a dependency in your `build.zig.zon` file. In your	18	You first need to add zg as a dependency in your `build.zig.zon` file. In your
16	Zig project's root directory, run:	19	Zig project's root directory, run:
17		20
18	```plain	21	```plain
19	zig fetch --save https://codeberg.org/atman/zg/archive/v0.14.0-rc1.tar.gz	22	zig fetch --save https://codeberg.org/atman/zg/archive/v0.14.1.tar.gz
20	```	23	```
21		24
22	Then instantiate the dependency in your `build.zig`:	25	Then instantiate the dependency in your `build.zig`:
@@ -25,12 +28,14 @@ Then instantiate the dependency in your `build.zig`:
25	const zg = b.dependency("zg", .{});	28	const zg = b.dependency("zg", .{});
26	```	29	```
27		30
		31
28	## A Modular Approach	32	## A Modular Approach
29		33
30	zg is a modular library. This approach minimizes binary file size and memory	34	zg is a modular library. This approach minimizes binary file size and memory
31	requirements by only including the Unicode data required for the specified module.	35	requirements by only including the Unicode data required for the specified module.
32	The following sections describe the various modules and their specific use case.	36	The following sections describe the various modules and their specific use case.
33		37
		38
34	### Init and Setup	39	### Init and Setup
35		40
36	The code examples will show the use of `Module.init(allocator)` to create the	41	The code examples will show the use of `Module.init(allocator)` to create the
@@ -67,7 +72,7 @@ const code_point = @import("code_point");
67		72
68	test "Code point iterator" {	73	test "Code point iterator" {
69	const str = "Hi 😊";	74	const str = "Hi 😊";
70	var iter = code_point.Iterator{ .bytes = str };	75	var iter: code_point.Iterator = .init(str);
71	var i: usize = 0;	76	var i: usize = 0;
72		77
73	while (iter.next()) \|cp\| : (i += 1) {	78	while (iter.next()) \|cp\| : (i += 1) {
@@ -78,25 +83,60 @@ test "Code point iterator" {
78		83
79	if (i == 3) {	84	if (i == 3) {
80	try expect(cp.code == '😊');	85	try expect(cp.code == '😊');
81
82	// The `offset` field is the byte offset in the	86	// The `offset` field is the byte offset in the
83	// source string.	87	// source string.
84	try expect(cp.offset == 3);	88	try expect(cp.offset == 3);
85	try expectEqual(code_point.CodePoint, code_point.decodeAtIndex(str, cp.offset));	89	try expectEqual(cp, code_point.decodeAtIndex(str, cp.offset).?);
86
87	// The `len` field is the length in bytes of the	90	// The `len` field is the length in bytes of the
88	// code point in the source string.	91	// code point in the source string.
89	try expect(cp.len == 4);	92	try expect(cp.len == 4);
		93	// There is also a 'cursor' decode, like so:
		94	{
		95	var cursor = cp.offset;
		96	try expectEqual(cp, code_point.decodeAtCursor(str, &cursor).?);
		97	// Which advances the cursor variable to the next possible
		98	// offset, in this case, `str.len`. Don't forget to account
		99	// for this possibility!
		100	try expectEqual(cp.offset + cp.len, cursor);
		101	}
		102	// There's also this, for when you aren't sure if you have the
		103	// correct start for a code point:
		104	try expectEqual(cp, code_point.codepointAtIndex(str, cp.offset + 1).?);
90	}	105	}
		106	// Reverse iteration is also an option:
		107	var r_iter: code_point.ReverseIterator = .init(str);
		108	// Both iterators can be peeked:
		109	try expectEqual('😊', r_iter.peek().?.code);
		110	try expectEqual('😊', r_iter.prev().?.code);
		111	// Both kinds of iterators can be reversed:
		112	var fwd_iter = r_iter.forwardIterator(); // or iter.reverseIterator();
		113	// This will always return the last codepoint from
		114	// the prior iterator, _if_ it yielded one:
		115	try expectEqual('😊', fwd_iter.next().?.code);
91	}	116	}
92	}	117	}
93	```	118	```
94		119
		120	Note that it's safe to call CodePoint functions on invalid
		121	UTF-8. Iterators and decode functions will return the Unicode
		122	Replacement Character `U+FFFD`, according to the Substitution of Maximal
		123	Subparts algorithm, for any invalid code unit sequences encountered.
		124
		125
95	## Grapheme Clusters	126	## Grapheme Clusters
96		127
97	Many characters are composed from more than one code point. These are known as	128	Many characters are composed from more than one code point. These
98	Grapheme Clusters, and the `Graphemes` module has a data structure to represent	129	are known as Grapheme Clusters, and the `Graphemes` module has a
99	them, `Grapheme`, and an `Iterator` to iterate over them in a string.	130	data structure to represent them, `Grapheme`, and an `Iterator` and
		131	`ReverseIterator` to iterate over them in a string.
		132
		133	There is also `graphemeAtIndex`, which returns whatever grapheme
		134	belongs to the index; this does not have to be on a valid grapheme
		135	or codepoint boundary, but it is illegal to call on an empty string.
		136	Last, `iterateAfterGrapheme` or `iterateBeforeGrapheme` will provide
		137	forward or backward grapheme iterators of the string, from the grapheme
		138	provided. Thus, given an index, you can begin forward or backward
		139	iteration at that index without needing to slice the string.
100		140
101	In your `build.zig`:	141	In your `build.zig`:
102		142
@@ -139,6 +179,56 @@ test "Grapheme cluster iterator" {
139	}	179	}
140	```	180	```
141		181
		182
		183	## Words
		184
		185	Unicode has a standard word segmentation algorithm, which gives good
		186	results for most languages. Some languages, such as Thai, require a
		187	dictionary to find the boundary between words; these cases are not
		188	handled by the standard algorithm.
		189
		190	`zg` implements that algorithm in the `Words` module. As a note,
		191	the iterators and functions provided here will yield segments which
		192	are not a "word" in the conventional sense, but word _boundaries_.
		193	Specifically, the iterators in this module will return every segment of
		194	a string, ensuring that words are kept whole when encountered. If the
		195	word breaks are of primary interest, you'll want to use the `.offset`
		196	field of each iterated value, and handle `string.len` as the final case
		197	when the iteration returns `null`.
		198
		199	The API is congruent with `Graphemes`: forward and backward iterators,
		200	`wordAtIndex`, and `iterateAfter` and before.
		201
		202	In your `build.zig`:
		203
		204	```zig
		205	exe.root_module.addImport("Words", zg.module("Words"));
		206	```
		207
		208	In your code:
		209
		210	```zig
		211	const Words = @import("Words");
		212
		213	test "Words" {
		214	const wb = try Words.init(testing.allocator);
		215	defer wb.deinit(testing.allocator);
		216	const word_str = "Metonym Μετωνύμιο メトニム";
		217	var w_iter = wb.iterator(word_str);
		218	try testing.expectEqualStrings("Metonym", w_iter.next().?.bytes(word_str));
		219	// Spaces are "words" too!
		220	try testing.expectEqualStrings(" ", w_iter.next().?.bytes(word_str));
		221	const in_greek = w_iter.next().?;
		222	// wordAtIndex doesn't care if the index is valid for a codepoint:
		223	for (in_greek.offset..in_greek.offset + in_greek.len) \|i\| {
		224	const at_index = wb.wordAtIndex(word_str, i).bytes(word_str);
		225	try testing.expectEqualStrings("Μετωνύμιο", at_index);
		226	}
		227	_ = w_iter.next();
		228	try testing.expectEqualStrings("メトニム", w_iter.next().?.bytes(word_str));
		229	}
		230	```
		231
142	## Unicode General Categories	232	## Unicode General Categories
143		233
144	To detect the general category for a code point, use the `GeneralCategories` module.	234	To detect the general category for a code point, use the `GeneralCategories` module.
@@ -279,24 +369,24 @@ Unicode normalization is the process of converting a string into a uniform
279	representation that can guarantee a known structure by following a strict set	369	representation that can guarantee a known structure by following a strict set
280	of rules. There are four normalization forms:	370	of rules. There are four normalization forms:
281		371
282	Canonical Composition (NFC)	372	Canonical Composition (NFC)
283	: The most compact representation obtained by first	373	: The most compact representation obtained by first
284	decomposing to Canonical Decomposition and then composing to NFC.	374	decomposing to Canonical Decomposition and then composing to NFC.
285		375
286	Compatibility Composition (NFKC)	376	Compatibility Composition (NFKC)
287	: The most comprehensive composition obtained	377	: The most comprehensive composition obtained
288	by first decomposing to Compatibility Decomposition and then composing to NFKC.	378	by first decomposing to Compatibility Decomposition and then composing to NFKC.
289		379
290	Canonical Decomposition (NFD)	380	Canonical Decomposition (NFD)
291	: Only code points with canonical decompositions	381	: Only code points with canonical decompositions
292	are decomposed. This is a more compact and faster decomposition but will not	382	are decomposed. This is a more compact and faster decomposition but will not
293	provide the most comprehensive normalization possible.	383	provide the most comprehensive normalization possible.
294		384
295	Compatibility Decomposition (NFKD)	385	Compatibility Decomposition (NFKD)
296	: The most comprehensive decomposition method	386	: The most comprehensive decomposition method
297	where both canonical and compatibility decompositions are performed recursively.	387	where both canonical and compatibility decompositions are performed recursively.
298		388
299	zg has methods to produce all four normalization forms in the `Normalize` module.	389	`zg` has methods to produce all four normalization forms in the `Normalize` module.
300		390
301	In your `build.zig`:	391	In your `build.zig`:
302		392
@@ -493,7 +583,7 @@ in the same fashion as shown for `CaseFolding` and `Normalize`.
493		583
494	## Scripts	584	## Scripts
495		585
496	Unicode categorizes code points by the Script in which they belong. A Script	586	Unicode categorizes code points by the Script in which they belong. A Script
497	collects letters and other symbols that belong to a particular writing system.	587	collects letters and other symbols that belong to a particular writing system.
498	You can detect the Script for a code point with the `Scripts` module.	588	You can detect the Script for a code point with the `Scripts` module.
499		589
@@ -548,83 +638,35 @@ test "Emoji" {
548	}	638	}
549	```	639	```
550		640
551	## Relation to Ziglyph	641	## Limits
552		642
553	zg is a total re-write of some of the components of Ziglyph. The idea was to	643	Iterators, and fragment types such as `CodePoint`, `Grapheme` and
554	reduce binary size and improve performance. These goals were achieved by using	644	`Word`, use a `u32` to store the offset into a string, and the length of
555	trie-like data structures (inspired by [Ghostty's implementation](https://mitchellh.com/writing/ghostty-devlog-006))	645	the fragment (`CodePoint` uses a `u3` for length, actually).
556	instead of generated functions. Where Ziglyph uses a function call, zg uses an	646
557	array lookup, which is quite faster. In addition, all these data structures in	647	4GiB is a lot of string. There are a few reasons to work with that much
558	zg are loaded at runtime from compressed versions in the binary. This allows	648	string, log files primarily, but fewer to bring it all into memory at
559	for smaller binary sizes at the expense of increased memory	649	once, and practically no reason at all to do anything to such a string
560	footprint at runtime.	650	without breaking it into smaller piece to work with.
561		651
562	Benchmarks demonstrate the above stated goals have been met:	652	Also, Zig compiles on 32 bit systems, where `usize` is a `u32`. Code
563		653	running on such systems has no choice but to handle slices in smaller
564	```plain	654	pieces. In general, if you want code to perform correctly when
565	Binary sizes =======	655	encountering multi-gigabyte strings, you'll need to code for that, at a
566		656	level one or two steps above that in which you'll want to, for example,
567	172K ziglyph_case	657	iterate some graphemes of that string.
568	109K zg_case	658
569		659	That all said, `zg` modules can be passed the Boolean config option
570	299K ziglyph_caseless	660	`fat_offset`, which will make all of those data structures use a `u64`
571	175K zg_caseless	661	instead. I added this option not because you should use it, which you
572		662	should not, but to encourage awareness that code operating on strings
573	91K ziglyph_codepoint	663	needs to pay attention to the size of those strings, and have a plan for
574	91K zg_codepoint	664	when sizes get out of specification. What would your code do with a
575		665	1MiB region of string with no newline? There are many questions of this
576	108K ziglyph_grapheme	666	nature, and robust code must detect when data is out of the expected
577	109K zg_grapheme	667	envelope, so it can respond accordingly.
578		668
579	208K ziglyph_normalizer	669	Code which does pay attention to these questions has no need for `u64`
580	175K zg_normalize	670	sized offsets, and code which does not will not be helped by them. But
581		671	perhaps yours is an exception, in which case, by all means, configure
582	124K ziglyph_width	672	accordingly.
583	109K zg_width
584
585	Benchmarks ==========
586
587	Ziglyph toUpperStr/toLowerStr: result: 7756580, took: 74
588	Ziglyph isUpperStr/isLowerStr: result: 110959, took: 17
589	zg toUpperStr/toLowerStr: result: 7756580, took: 58
590	zg isUpperStr/isLowerStr: result: 110959, took: 11
591
592	Ziglyph Normalizer.eqlCaseless: result: 626, took: 479
593	zg CaseFolding.canonCaselessMatch: result: 626, took: 296
594	zg CaseFolding.compatCaselessMatch: result: 626, took: 604
595
596	Ziglyph CodePointIterator: result: 3691806, took: 2.5
597	zg code_point.Iterator: result: 3691806, took: 3.3
598
599	Ziglyph GraphemeIterator: result: 3691806, took: 78
600	zg Graphemes.Iterator: result: 3691806, took: 31
601
602	Ziglyph Normalizer.nfkc: result: 3856654, took: 411
603	zg Normalize.nfkc: result: 3856654, took: 208
604
605	Ziglyph Normalizer.nfc: result: 3878290, took: 56
606	zg Normalize.nfc: result: 3878290, took: 31
607
608	Ziglyph Normalizer.nfkd: result: 3928890, took: 163
609	zg Normalize.nfkd: result: 3928890, took: 101
610
611	Ziglyph Normalizer.nfd: result: 3950526, took: 160
612	zg Normalize.nfd: result: 3950526, took: 101
613
614	Ziglyph Normalizer.eql: result: 626, took: 321
615	Zg Normalize.eql: result: 626, took: 60
616
617	Ziglyph display_width.strWidth: result: 3700914, took: 89
618	zg DisplayWidth.strWidth: result: 3700914, took: 46
619	```
620
621	These results were obtained on a MacBook Pro (2021) with M1 Pro and 16 GiB of RAM.
622
623	In contrast to Ziglyph, zg does not have:
624
625	- Word segmentation
626	- Sentence segmentation
627	- Collation
628
629	It's possible that any missing functionality will be added in future versions,
630	but only if enough demand is present in the community.