diff options
| author | 2024-11-26 20:42:51 -0500 | |
|---|---|---|
| committer | 2024-11-26 20:42:51 -0500 | |
| commit | 0a9c8f1418ecd54ffaabf4b5256e2d77502700ba (patch) | |
| tree | a503957f99307a521c44886ebd7929f2f7e3e6cc | |
| parent | Update README.md (diff) | |
| download | zg-0a9c8f1418ecd54ffaabf4b5256e2d77502700ba.tar.gz zg-0a9c8f1418ecd54ffaabf4b5256e2d77502700ba.tar.xz zg-0a9c8f1418ecd54ffaabf4b5256e2d77502700ba.zip | |
Update URL in README
Also documents the `cjk` option, and how to enable it.
| -rw-r--r-- | README.md | 52 |
1 files changed, 38 insertions, 14 deletions
| @@ -1,18 +1,22 @@ | |||
| 1 | # zg | 1 | # zg |
| 2 | |||
| 2 | zg provides Unicode text processing for Zig projects. | 3 | zg provides Unicode text processing for Zig projects. |
| 3 | 4 | ||
| 4 | ## Unicode Version | 5 | ## Unicode Version |
| 6 | |||
| 5 | The Unicode version supported by zg is 15.1.0. | 7 | The Unicode version supported by zg is 15.1.0. |
| 6 | 8 | ||
| 7 | ## Zig Version | 9 | ## Zig Version |
| 10 | |||
| 8 | The minimum Zig version required is 0.14 dev. | 11 | The minimum Zig version required is 0.14 dev. |
| 9 | 12 | ||
| 10 | ## Integrating zg into your Zig Project | 13 | ## Integrating zg into your Zig Project |
| 14 | |||
| 11 | You first need to add zg as a dependency in your `build.zig.zon` file. In your | 15 | You first need to add zg as a dependency in your `build.zig.zon` file. In your |
| 12 | Zig project's root directory, run: | 16 | Zig project's root directory, run: |
| 13 | 17 | ||
| 14 | ```plain | 18 | ```plain |
| 15 | zig fetch --save https://codeberg.org/dude_the_builder/zg/archive/v0.13.3.tar.gz | 19 | zig fetch --save https://codeberg.org/atman/zg/archive/v0.13.3.tar.gz |
| 16 | ``` | 20 | ``` |
| 17 | 21 | ||
| 18 | Then instantiate the dependency in your `build.zig`: | 22 | Then instantiate the dependency in your `build.zig`: |
| @@ -22,11 +26,13 @@ const zg = b.dependency("zg", .{}); | |||
| 22 | ``` | 26 | ``` |
| 23 | 27 | ||
| 24 | ## A Modular Approach | 28 | ## A Modular Approach |
| 29 | |||
| 25 | zg is a modular library. This approach minimizes binary file size and memory | 30 | zg is a modular library. This approach minimizes binary file size and memory |
| 26 | requirements by only including the Unicode data required for the specified module. | 31 | requirements by only including the Unicode data required for the specified module. |
| 27 | The following sections describe the various modules and their specific use case. | 32 | The following sections describe the various modules and their specific use case. |
| 28 | 33 | ||
| 29 | ## Code Points | 34 | ## Code Points |
| 35 | |||
| 30 | In the `code_point` module, you'll find a data structure representing a single code | 36 | In the `code_point` module, you'll find a data structure representing a single code |
| 31 | point, `CodePoint`, and an `Iterator` to iterate over the code points in a string. | 37 | point, `CodePoint`, and an `Iterator` to iterate over the code points in a string. |
| 32 | 38 | ||
| @@ -68,6 +74,7 @@ test "Code point iterator" { | |||
| 68 | ``` | 74 | ``` |
| 69 | 75 | ||
| 70 | ## Grapheme Clusters | 76 | ## Grapheme Clusters |
| 77 | |||
| 71 | Many characters are composed from more than one code point. These are known as | 78 | Many characters are composed from more than one code point. These are known as |
| 72 | Grapheme Clusters and the `grapheme` module has a data structure to represent | 79 | Grapheme Clusters and the `grapheme` module has a data structure to represent |
| 73 | them, `Grapheme`, and an `Iterator` to iterate over them in a string. | 80 | them, `Grapheme`, and an `Iterator` to iterate over them in a string. |
| @@ -115,6 +122,7 @@ test "Grapheme cluster iterator" { | |||
| 115 | ``` | 122 | ``` |
| 116 | 123 | ||
| 117 | ## Unicode General Categories | 124 | ## Unicode General Categories |
| 125 | |||
| 118 | To detect the general category for a code point, use the `GenCatData` module. | 126 | To detect the general category for a code point, use the `GenCatData` module. |
| 119 | 127 | ||
| 120 | In your `build.zig`: | 128 | In your `build.zig`: |
| @@ -152,6 +160,7 @@ test "General Category" { | |||
| 152 | ``` | 160 | ``` |
| 153 | 161 | ||
| 154 | ## Unicode Properties | 162 | ## Unicode Properties |
| 163 | |||
| 155 | You can detect common properties of a code point with the `PropsData` module. | 164 | You can detect common properties of a code point with the `PropsData` module. |
| 156 | 165 | ||
| 157 | In your `build.zig`: | 166 | In your `build.zig`: |
| @@ -182,7 +191,7 @@ test "Properties" { | |||
| 182 | // Accents, dieresis, and other combining marks. | 191 | // Accents, dieresis, and other combining marks. |
| 183 | try expect(pd.isDiacritic('\u{301}')); | 192 | try expect(pd.isDiacritic('\u{301}')); |
| 184 | 193 | ||
| 185 | // Unicode has a specification for valid identifiers like | 194 | // Unicode has a specification for valid identifiers like |
| 186 | // the ones used in programming and regular expressions. | 195 | // the ones used in programming and regular expressions. |
| 187 | try expect(pd.isIdStart('Z')); // Identifier start character | 196 | try expect(pd.isIdStart('Z')); // Identifier start character |
| 188 | try expect(!pd.isIdStart('1')); | 197 | try expect(!pd.isIdStart('1')); |
| @@ -204,6 +213,7 @@ test "Properties" { | |||
| 204 | ``` | 213 | ``` |
| 205 | 214 | ||
| 206 | ## Letter Case Detection and Conversion | 215 | ## Letter Case Detection and Conversion |
| 216 | |||
| 207 | To detect and convert to and from different letter cases, use the `CaseData` | 217 | To detect and convert to and from different letter cases, use the `CaseData` |
| 208 | module. | 218 | module. |
| 209 | 219 | ||
| @@ -246,7 +256,8 @@ test "Case" { | |||
| 246 | ``` | 256 | ``` |
| 247 | 257 | ||
| 248 | ## Normalization | 258 | ## Normalization |
| 249 | Unicode normalization is the process of converting a string into a uniform | 259 | |
| 260 | Unicode normalization is the process of converting a string into a uniform | ||
| 250 | representation that can guarantee a known structure by following a strict set | 261 | representation that can guarantee a known structure by following a strict set |
| 251 | of rules. There are four normalization forms: | 262 | of rules. There are four normalization forms: |
| 252 | 263 | ||
| @@ -260,14 +271,14 @@ by first decomposing to Compatibility Decomposition and then composing to NFKC. | |||
| 260 | 271 | ||
| 261 | Canonical Decomposition (NFD) | 272 | Canonical Decomposition (NFD) |
| 262 | : Only code points with canonical decompositions | 273 | : Only code points with canonical decompositions |
| 263 | are decomposed. This is a more compact and faster decomposition but will not | 274 | are decomposed. This is a more compact and faster decomposition but will not |
| 264 | provide the most comprehensive normalization possible. | 275 | provide the most comprehensive normalization possible. |
| 265 | 276 | ||
| 266 | Compatibility Decomposition (NFKD) | 277 | Compatibility Decomposition (NFKD) |
| 267 | : The most comprehensive decomposition method | 278 | : The most comprehensive decomposition method |
| 268 | where both canonical and compatibility decompositions are performed recursively. | 279 | where both canonical and compatibility decompositions are performed recursively. |
| 269 | 280 | ||
| 270 | zg has methods to produce all four normalization forms in the `Normalize` module. | 281 | zg has methods to produce all four normalization forms in the `Normalize` module. |
| 271 | 282 | ||
| 272 | In your `build.zig`: | 283 | In your `build.zig`: |
| 273 | 284 | ||
| @@ -316,6 +327,7 @@ test "Normalization" { | |||
| 316 | ``` | 327 | ``` |
| 317 | 328 | ||
| 318 | ## Caseless Matching via Case Folding | 329 | ## Caseless Matching via Case Folding |
| 330 | |||
| 319 | Unicode provides a more efficient way of comparing strings while ignoring letter | 331 | Unicode provides a more efficient way of comparing strings while ignoring letter |
| 320 | case differences: case folding. When you case fold a string, it's converted into a | 332 | case differences: case folding. When you case fold a string, it's converted into a |
| 321 | normalized case form suitable for efficient matching. Use the `CaseFold` module | 333 | normalized case form suitable for efficient matching. Use the `CaseFold` module |
| @@ -365,10 +377,11 @@ test "Caseless matching" { | |||
| 365 | ``` | 377 | ``` |
| 366 | 378 | ||
| 367 | ## Display Width of Characters and Strings | 379 | ## Display Width of Characters and Strings |
| 380 | |||
| 368 | When displaying text with a fixed-width font on a terminal screen, it's very | 381 | When displaying text with a fixed-width font on a terminal screen, it's very |
| 369 | important to know exactly how many columns or cells each character should take. | 382 | important to know exactly how many columns or cells each character should take. |
| 370 | Most characters will use one column, but there are many, like emoji and East- | 383 | Most characters will use one column, but there are many, like emoji and East- |
| 371 | Asian ideographs that need more space. The `DisplayWidth` module provides | 384 | Asian ideographs that need more space. The `DisplayWidth` module provides |
| 372 | methods for this purpose. It also has methods that use the display width calculation | 385 | methods for this purpose. It also has methods that use the display width calculation |
| 373 | to `center`, `padLeft`, `padRight`, and `wrap` text. | 386 | to `center`, `padLeft`, `padRight`, and `wrap` text. |
| 374 | 387 | ||
| @@ -418,18 +431,29 @@ test "Display width" { | |||
| 418 | const wrapped = try dw.wrap(allocator, input, 10, 3); | 431 | const wrapped = try dw.wrap(allocator, input, 10, 3); |
| 419 | defer allocator.free(wrapped); | 432 | defer allocator.free(wrapped); |
| 420 | const want = | 433 | const want = |
| 421 | \\The quick | 434 | \\The quick |
| 422 | \\brown fox | 435 | \\brown fox |
| 423 | \\jumped | 436 | \\jumped |
| 424 | \\over the | 437 | \\over the |
| 425 | \\lazy dog! | 438 | \\lazy dog! |
| 426 | ; | 439 | ; |
| 427 | try expectEqualStrings(want, wrapped); | 440 | try expectEqualStrings(want, wrapped); |
| 428 | } | 441 | } |
| 429 | ``` | 442 | ``` |
| 430 | 443 | ||
| 444 | This has a build option, `"cjk"`, which will consider [ambiguous characters](https://www.unicode.org/reports/tr11/tr11-6.html) as double-width. | ||
| 445 | |||
| 446 | To choose this option, add it to the dependency like so: | ||
| 447 | |||
| 448 | ```zig | ||
| 449 | const zg = b.dependency("zg", .{ | ||
| 450 | .cjk = true, | ||
| 451 | }); | ||
| 452 | ``` | ||
| 453 | |||
| 431 | ## Scripts | 454 | ## Scripts |
| 432 | Unicode categorizes code points by the Script in which they belong. A Script | 455 | |
| 456 | Unicode categorizes code points by the Script in which they belong. A Script | ||
| 433 | collects letters and other symbols that belong to a particular writing system. | 457 | collects letters and other symbols that belong to a particular writing system. |
| 434 | You can detect the Script for a code point with the `ScriptsData` module. | 458 | You can detect the Script for a code point with the `ScriptsData` module. |
| 435 | 459 | ||
| @@ -457,13 +481,14 @@ test "Scripts" { | |||
| 457 | ``` | 481 | ``` |
| 458 | 482 | ||
| 459 | ## Relation to Ziglyph | 483 | ## Relation to Ziglyph |
| 484 | |||
| 460 | zg is a total re-write of some of the components of Ziglyph. The idea was to | 485 | zg is a total re-write of some of the components of Ziglyph. The idea was to |
| 461 | reduce binary size and improve performance. These goals were achieved by using | 486 | reduce binary size and improve performance. These goals were achieved by using |
| 462 | trie-like data structures (inspired by [Ghostty's implementation](https://mitchellh.com/writing/ghostty-devlog-006)) | 487 | trie-like data structures (inspired by [Ghostty's implementation](https://mitchellh.com/writing/ghostty-devlog-006)) |
| 463 | instead of generated functions. Where Ziglyph uses a function call, zg uses an | 488 | instead of generated functions. Where Ziglyph uses a function call, zg uses an |
| 464 | array lookup, which is quite faster. In addition, all these data structures in | 489 | array lookup, which is quite faster. In addition, all these data structures in |
| 465 | zg are loaded at runtime from compressed versions in the binary. This allows | 490 | zg are loaded at runtime from compressed versions in the binary. This allows |
| 466 | for smaller binary sizes at the expense of increased memory | 491 | for smaller binary sizes at the expense of increased memory |
| 467 | footprint at runtime. | 492 | footprint at runtime. |
| 468 | 493 | ||
| 469 | Benchmarks demonstrate the above stated goals have been met: | 494 | Benchmarks demonstrate the above stated goals have been met: |
| @@ -535,4 +560,3 @@ In contrast to Ziglyph, zg does not have: | |||
| 535 | 560 | ||
| 536 | It's possible that any missing functionality will be added in future versions, | 561 | It's possible that any missing functionality will be added in future versions, |
| 537 | but only if enough demand is present in the community. | 562 | but only if enough demand is present in the community. |
| 538 | |||