summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorGravatar Sam Atman2024-11-26 20:42:51 -0500
committerGravatar Sam Atman2024-11-26 20:42:51 -0500
commit0a9c8f1418ecd54ffaabf4b5256e2d77502700ba (patch)
treea503957f99307a521c44886ebd7929f2f7e3e6cc
parentUpdate README.md (diff)
downloadzg-0a9c8f1418ecd54ffaabf4b5256e2d77502700ba.tar.gz
zg-0a9c8f1418ecd54ffaabf4b5256e2d77502700ba.tar.xz
zg-0a9c8f1418ecd54ffaabf4b5256e2d77502700ba.zip
Update URL in README
Also documents the `cjk` option, and how to enable it.
-rw-r--r--README.md52
1 files changed, 38 insertions, 14 deletions
diff --git a/README.md b/README.md
index 33213aa..c854fae 100644
--- a/README.md
+++ b/README.md
@@ -1,18 +1,22 @@
1# zg 1# zg
2
2zg provides Unicode text processing for Zig projects. 3zg provides Unicode text processing for Zig projects.
3 4
4## Unicode Version 5## Unicode Version
6
5The Unicode version supported by zg is 15.1.0. 7The Unicode version supported by zg is 15.1.0.
6 8
7## Zig Version 9## Zig Version
10
8The minimum Zig version required is 0.14 dev. 11The minimum Zig version required is 0.14 dev.
9 12
10## Integrating zg into your Zig Project 13## Integrating zg into your Zig Project
14
11You first need to add zg as a dependency in your `build.zig.zon` file. In your 15You first need to add zg as a dependency in your `build.zig.zon` file. In your
12Zig project's root directory, run: 16Zig project's root directory, run:
13 17
14```plain 18```plain
15zig fetch --save https://codeberg.org/dude_the_builder/zg/archive/v0.13.3.tar.gz 19zig fetch --save https://codeberg.org/atman/zg/archive/v0.13.3.tar.gz
16``` 20```
17 21
18Then instantiate the dependency in your `build.zig`: 22Then instantiate the dependency in your `build.zig`:
@@ -22,11 +26,13 @@ const zg = b.dependency("zg", .{});
22``` 26```
23 27
24## A Modular Approach 28## A Modular Approach
29
25zg is a modular library. This approach minimizes binary file size and memory 30zg is a modular library. This approach minimizes binary file size and memory
26requirements by only including the Unicode data required for the specified module. 31requirements by only including the Unicode data required for the specified module.
27The following sections describe the various modules and their specific use case. 32The following sections describe the various modules and their specific use case.
28 33
29## Code Points 34## Code Points
35
30In the `code_point` module, you'll find a data structure representing a single code 36In the `code_point` module, you'll find a data structure representing a single code
31point, `CodePoint`, and an `Iterator` to iterate over the code points in a string. 37point, `CodePoint`, and an `Iterator` to iterate over the code points in a string.
32 38
@@ -68,6 +74,7 @@ test "Code point iterator" {
68``` 74```
69 75
70## Grapheme Clusters 76## Grapheme Clusters
77
71Many characters are composed from more than one code point. These are known as 78Many characters are composed from more than one code point. These are known as
72Grapheme Clusters and the `grapheme` module has a data structure to represent 79Grapheme Clusters and the `grapheme` module has a data structure to represent
73them, `Grapheme`, and an `Iterator` to iterate over them in a string. 80them, `Grapheme`, and an `Iterator` to iterate over them in a string.
@@ -115,6 +122,7 @@ test "Grapheme cluster iterator" {
115``` 122```
116 123
117## Unicode General Categories 124## Unicode General Categories
125
118To detect the general category for a code point, use the `GenCatData` module. 126To detect the general category for a code point, use the `GenCatData` module.
119 127
120In your `build.zig`: 128In your `build.zig`:
@@ -152,6 +160,7 @@ test "General Category" {
152``` 160```
153 161
154## Unicode Properties 162## Unicode Properties
163
155You can detect common properties of a code point with the `PropsData` module. 164You can detect common properties of a code point with the `PropsData` module.
156 165
157In your `build.zig`: 166In your `build.zig`:
@@ -182,7 +191,7 @@ test "Properties" {
182 // Accents, dieresis, and other combining marks. 191 // Accents, dieresis, and other combining marks.
183 try expect(pd.isDiacritic('\u{301}')); 192 try expect(pd.isDiacritic('\u{301}'));
184 193
185 // Unicode has a specification for valid identifiers like 194 // Unicode has a specification for valid identifiers like
186 // the ones used in programming and regular expressions. 195 // the ones used in programming and regular expressions.
187 try expect(pd.isIdStart('Z')); // Identifier start character 196 try expect(pd.isIdStart('Z')); // Identifier start character
188 try expect(!pd.isIdStart('1')); 197 try expect(!pd.isIdStart('1'));
@@ -204,6 +213,7 @@ test "Properties" {
204``` 213```
205 214
206## Letter Case Detection and Conversion 215## Letter Case Detection and Conversion
216
207To detect and convert to and from different letter cases, use the `CaseData` 217To detect and convert to and from different letter cases, use the `CaseData`
208module. 218module.
209 219
@@ -246,7 +256,8 @@ test "Case" {
246``` 256```
247 257
248## Normalization 258## Normalization
249Unicode normalization is the process of converting a string into a uniform 259
260Unicode normalization is the process of converting a string into a uniform
250representation that can guarantee a known structure by following a strict set 261representation that can guarantee a known structure by following a strict set
251of rules. There are four normalization forms: 262of rules. There are four normalization forms:
252 263
@@ -260,14 +271,14 @@ by first decomposing to Compatibility Decomposition and then composing to NFKC.
260 271
261Canonical Decomposition (NFD) 272Canonical Decomposition (NFD)
262: Only code points with canonical decompositions 273: Only code points with canonical decompositions
263are decomposed. This is a more compact and faster decomposition but will not 274are decomposed. This is a more compact and faster decomposition but will not
264provide the most comprehensive normalization possible. 275provide the most comprehensive normalization possible.
265 276
266Compatibility Decomposition (NFKD) 277Compatibility Decomposition (NFKD)
267: The most comprehensive decomposition method 278: The most comprehensive decomposition method
268where both canonical and compatibility decompositions are performed recursively. 279where both canonical and compatibility decompositions are performed recursively.
269 280
270zg has methods to produce all four normalization forms in the `Normalize` module. 281zg has methods to produce all four normalization forms in the `Normalize` module.
271 282
272In your `build.zig`: 283In your `build.zig`:
273 284
@@ -316,6 +327,7 @@ test "Normalization" {
316``` 327```
317 328
318## Caseless Matching via Case Folding 329## Caseless Matching via Case Folding
330
319Unicode provides a more efficient way of comparing strings while ignoring letter 331Unicode provides a more efficient way of comparing strings while ignoring letter
320case differences: case folding. When you case fold a string, it's converted into a 332case differences: case folding. When you case fold a string, it's converted into a
321normalized case form suitable for efficient matching. Use the `CaseFold` module 333normalized case form suitable for efficient matching. Use the `CaseFold` module
@@ -365,10 +377,11 @@ test "Caseless matching" {
365``` 377```
366 378
367## Display Width of Characters and Strings 379## Display Width of Characters and Strings
380
368When displaying text with a fixed-width font on a terminal screen, it's very 381When displaying text with a fixed-width font on a terminal screen, it's very
369important to know exactly how many columns or cells each character should take. 382important to know exactly how many columns or cells each character should take.
370Most characters will use one column, but there are many, like emoji and East- 383Most characters will use one column, but there are many, like emoji and East-
371Asian ideographs that need more space. The `DisplayWidth` module provides 384Asian ideographs that need more space. The `DisplayWidth` module provides
372methods for this purpose. It also has methods that use the display width calculation 385methods for this purpose. It also has methods that use the display width calculation
373to `center`, `padLeft`, `padRight`, and `wrap` text. 386to `center`, `padLeft`, `padRight`, and `wrap` text.
374 387
@@ -418,18 +431,29 @@ test "Display width" {
418 const wrapped = try dw.wrap(allocator, input, 10, 3); 431 const wrapped = try dw.wrap(allocator, input, 10, 3);
419 defer allocator.free(wrapped); 432 defer allocator.free(wrapped);
420 const want = 433 const want =
421 \\The quick 434 \\The quick
422 \\brown fox 435 \\brown fox
423 \\jumped 436 \\jumped
424 \\over the 437 \\over the
425 \\lazy dog! 438 \\lazy dog!
426 ; 439 ;
427 try expectEqualStrings(want, wrapped); 440 try expectEqualStrings(want, wrapped);
428} 441}
429``` 442```
430 443
444This has a build option, `"cjk"`, which will consider [ambiguous characters](https://www.unicode.org/reports/tr11/tr11-6.html) as double-width.
445
446To choose this option, add it to the dependency like so:
447
448```zig
449const zg = b.dependency("zg", .{
450 .cjk = true,
451});
452```
453
431## Scripts 454## Scripts
432Unicode categorizes code points by the Script in which they belong. A Script 455
456Unicode categorizes code points by the Script in which they belong. A Script
433collects letters and other symbols that belong to a particular writing system. 457collects letters and other symbols that belong to a particular writing system.
434You can detect the Script for a code point with the `ScriptsData` module. 458You can detect the Script for a code point with the `ScriptsData` module.
435 459
@@ -457,13 +481,14 @@ test "Scripts" {
457``` 481```
458 482
459## Relation to Ziglyph 483## Relation to Ziglyph
484
460zg is a total re-write of some of the components of Ziglyph. The idea was to 485zg is a total re-write of some of the components of Ziglyph. The idea was to
461reduce binary size and improve performance. These goals were achieved by using 486reduce binary size and improve performance. These goals were achieved by using
462trie-like data structures (inspired by [Ghostty's implementation](https://mitchellh.com/writing/ghostty-devlog-006)) 487trie-like data structures (inspired by [Ghostty's implementation](https://mitchellh.com/writing/ghostty-devlog-006))
463instead of generated functions. Where Ziglyph uses a function call, zg uses an 488instead of generated functions. Where Ziglyph uses a function call, zg uses an
464array lookup, which is quite faster. In addition, all these data structures in 489array lookup, which is quite faster. In addition, all these data structures in
465zg are loaded at runtime from compressed versions in the binary. This allows 490zg are loaded at runtime from compressed versions in the binary. This allows
466for smaller binary sizes at the expense of increased memory 491for smaller binary sizes at the expense of increased memory
467footprint at runtime. 492footprint at runtime.
468 493
469Benchmarks demonstrate the above stated goals have been met: 494Benchmarks demonstrate the above stated goals have been met:
@@ -535,4 +560,3 @@ In contrast to Ziglyph, zg does not have:
535 560
536It's possible that any missing functionality will be added in future versions, 561It's possible that any missing functionality will be added in future versions,
537but only if enough demand is present in the community. 562but only if enough demand is present in the community.
538