Update URL in README

Also documents the `cjk` option, and how to enable it.
author: Sam Atman 2024-11-26 20:42:51 -0500
committer: Sam Atman 2024-11-26 20:42:51 -0500
commit: 0a9c8f1418ecd54ffaabf4b5256e2d77502700ba (patch)
tree: a503957f99307a521c44886ebd7929f2f7e3e6cc
parent: Update README.md (diff)
download: zg-0a9c8f1418ecd54ffaabf4b5256e2d77502700ba.tar.gz
zg-0a9c8f1418ecd54ffaabf4b5256e2d77502700ba.tar.xz
zg-0a9c8f1418ecd54ffaabf4b5256e2d77502700ba.zip
1 files changed, 38 insertions, 14 deletions
diff --git a/README.md b/README.md
index 33213aa..c854fae 100644
--- a/README.md
+++ b/README.md
@@ -1,18 +1,22 @@
 # zg
 zg provides Unicode text processing for Zig projects.
 ## Unicode Version
 The Unicode version supported by zg is 15.1.0.
 ## Zig Version
 The minimum Zig version required is 0.14 dev.
 ## Integrating zg into your Zig Project
 You first need to add zg as a dependency in your `build.zig.zon` file. In your
 Zig project's root directory, run:
 ```plain
-zig fetch --save https://codeberg.org/dude_the_builder/zg/archive/v0.13.3.tar.gz
+zig fetch --save https://codeberg.org/atman/zg/archive/v0.13.3.tar.gz
 ```
 Then instantiate the dependency in your `build.zig`:
@@ -22,11 +26,13 @@ const zg = b.dependency("zg", .{});
 ```
 ## A Modular Approach
 zg is a modular library. This approach minimizes binary file size and memory
 requirements by only including the Unicode data required for the specified module.
 The following sections describe the various modules and their specific use case.
 ## Code Points
 In the `code_point` module, you'll find a data structure representing a single code
 point, `CodePoint`, and an `Iterator` to iterate over the code points in a string.
@@ -68,6 +74,7 @@ test "Code point iterator" {
 ```
 ## Grapheme Clusters
 Many characters are composed from more than one code point. These are known as
 Grapheme Clusters and the `grapheme` module has a data structure to represent
 them, `Grapheme`, and an `Iterator` to iterate over them in a string.
@@ -115,6 +122,7 @@ test "Grapheme cluster iterator" {
 ```
 ## Unicode General Categories
 To detect the general category for a code point, use the `GenCatData` module.
 In your `build.zig`:
@@ -152,6 +160,7 @@ test "General Category" {
 ```
 ## Unicode Properties
 You can detect common properties of a code point with the `PropsData` module.
 In your `build.zig`:
@@ -182,7 +191,7 @@ test "Properties" {
    // Accents, dieresis, and other combining marks.
    try expect(pd.isDiacritic('\u{301}'));
-    // Unicode has a specification for valid identifiers like 
+    // Unicode has a specification for valid identifiers like
    // the ones used in programming and regular expressions.
    try expect(pd.isIdStart('Z')); // Identifier start character
    try expect(!pd.isIdStart('1'));
@@ -204,6 +213,7 @@ test "Properties" {
 ```
 ## Letter Case Detection and Conversion
 To detect and convert to and from different letter cases, use the `CaseData`
 module.
@@ -246,7 +256,8 @@ test "Case" {
 ```
 ## Normalization
-Unicode normalization is the process of converting a string into a uniform 
+Unicode normalization is the process of converting a string into a uniform
 representation that can guarantee a known structure by following a strict set
 of rules. There are four normalization forms:
@@ -260,14 +271,14 @@ by first decomposing to Compatibility Decomposition and then composing to NFKC.
 Canonical Decomposition (NFD)
 : Only code points with canonical decompositions
-are decomposed. This is a more compact and faster decomposition but will not 
+are decomposed. This is a more compact and faster decomposition but will not
 provide the most comprehensive normalization possible.
 Compatibility Decomposition (NFKD)
 : The most comprehensive decomposition method
 where both canonical and compatibility decompositions are performed recursively.
-zg has methods to produce all four normalization forms in the `Normalize` module. 
+zg has methods to produce all four normalization forms in the `Normalize` module.
 In your `build.zig`:
@@ -316,6 +327,7 @@ test "Normalization" {
 ```
 ## Caseless Matching via Case Folding
 Unicode provides a more efficient way of comparing strings while ignoring letter
 case differences: case folding. When you case fold a string, it's converted into a
 normalized case form suitable for efficient matching. Use the `CaseFold` module
@@ -365,10 +377,11 @@ test "Caseless matching" {
 ```
 ## Display Width of Characters and Strings
 When displaying text with a fixed-width font on a terminal screen, it's very
 important to know exactly how many columns or cells each character should take.
 Most characters will use one column, but there are many, like emoji and East-
-Asian ideographs that need more space. The `DisplayWidth` module provides 
+Asian ideographs that need more space. The `DisplayWidth` module provides
 methods for this purpose. It also has methods that use the display width calculation
 to `center`, `padLeft`, `padRight`, and `wrap` text.
@@ -418,18 +431,29 @@ test "Display width" {
    const wrapped = try dw.wrap(allocator, input, 10, 3);
    defer allocator.free(wrapped);
    const want =
-        \\The quick 
+        \\The quick
-        \\brown fox 
+        \\brown fox
-        \\jumped 
+        \\jumped
-        \\over the 
+        \\over the
        \\lazy dog!
    ;
    try expectEqualStrings(want, wrapped);
 }
 ```
+This has a build option, `"cjk"`, which will consider [ambiguous characters](https://www.unicode.org/reports/tr11/tr11-6.html) as double-width.
+To choose this option, add it to the dependency like so:
+```zig
+const zg = b.dependency("zg", .{
+    .cjk = true,
+});
+```
 ## Scripts
-Unicode categorizes code points by the Script in which they belong. A Script 
+Unicode categorizes code points by the Script in which they belong. A Script
 collects letters and other symbols that belong to a particular writing system.
 You can detect the Script for a code point with the `ScriptsData` module.
@@ -457,13 +481,14 @@ test "Scripts" {
 ```
 ## Relation to Ziglyph
 zg is a total re-write of some of the components of Ziglyph. The idea was to
 reduce binary size and improve performance. These goals were achieved by using
 trie-like data structures (inspired by [Ghostty's implementation](https://mitchellh.com/writing/ghostty-devlog-006))
-instead of generated functions. Where Ziglyph uses a function call, zg uses an 
+instead of generated functions. Where Ziglyph uses a function call, zg uses an
 array lookup, which is quite faster. In addition, all these data structures in
 zg are loaded at runtime from compressed versions in the binary. This allows
-for smaller binary sizes at the expense of increased memory 
+for smaller binary sizes at the expense of increased memory
 footprint at runtime.
 Benchmarks demonstrate the above stated goals have been met:
@@ -535,4 +560,3 @@ In contrast to Ziglyph, zg does not have:
 It's possible that any missing functionality will be added in future versions,
 but only if enough demand is present in the community.
author	Sam Atman	2024-11-26 20:42:51 -0500
committer	Sam Atman	2024-11-26 20:42:51 -0500
commit	0a9c8f1418ecd54ffaabf4b5256e2d77502700ba (patch)
tree	a503957f99307a521c44886ebd7929f2f7e3e6cc
parent	Update README.md (diff)
download	zg-0a9c8f1418ecd54ffaabf4b5256e2d77502700ba.tar.gz zg-0a9c8f1418ecd54ffaabf4b5256e2d77502700ba.tar.xz zg-0a9c8f1418ecd54ffaabf4b5256e2d77502700ba.zip

diff --git a/README.md b/README.md index 33213aa..c854fae 100644 --- a/README.md +++ b/README.md
@@ -1,18 +1,22 @@
1	# zg	1	# zg
		2
2	zg provides Unicode text processing for Zig projects.	3	zg provides Unicode text processing for Zig projects.
3		4
4	## Unicode Version	5	## Unicode Version
		6
5	The Unicode version supported by zg is 15.1.0.	7	The Unicode version supported by zg is 15.1.0.
6		8
7	## Zig Version	9	## Zig Version
		10
8	The minimum Zig version required is 0.14 dev.	11	The minimum Zig version required is 0.14 dev.
9		12
10	## Integrating zg into your Zig Project	13	## Integrating zg into your Zig Project
		14
11	You first need to add zg as a dependency in your `build.zig.zon` file. In your	15	You first need to add zg as a dependency in your `build.zig.zon` file. In your
12	Zig project's root directory, run:	16	Zig project's root directory, run:
13		17
14	```plain	18	```plain
15	zig fetch --save https://codeberg.org/dude_the_builder/zg/archive/v0.13.3.tar.gz	19	zig fetch --save https://codeberg.org/atman/zg/archive/v0.13.3.tar.gz
16	```	20	```
17		21
18	Then instantiate the dependency in your `build.zig`:	22	Then instantiate the dependency in your `build.zig`:
@@ -22,11 +26,13 @@ const zg = b.dependency("zg", .{});
22	```	26	```
23		27
24	## A Modular Approach	28	## A Modular Approach
		29
25	zg is a modular library. This approach minimizes binary file size and memory	30	zg is a modular library. This approach minimizes binary file size and memory
26	requirements by only including the Unicode data required for the specified module.	31	requirements by only including the Unicode data required for the specified module.
27	The following sections describe the various modules and their specific use case.	32	The following sections describe the various modules and their specific use case.
28		33
29	## Code Points	34	## Code Points
		35
30	In the `code_point` module, you'll find a data structure representing a single code	36	In the `code_point` module, you'll find a data structure representing a single code
31	point, `CodePoint`, and an `Iterator` to iterate over the code points in a string.	37	point, `CodePoint`, and an `Iterator` to iterate over the code points in a string.
32		38
@@ -68,6 +74,7 @@ test "Code point iterator" {
68	```	74	```
69		75
70	## Grapheme Clusters	76	## Grapheme Clusters
		77
71	Many characters are composed from more than one code point. These are known as	78	Many characters are composed from more than one code point. These are known as
72	Grapheme Clusters and the `grapheme` module has a data structure to represent	79	Grapheme Clusters and the `grapheme` module has a data structure to represent
73	them, `Grapheme`, and an `Iterator` to iterate over them in a string.	80	them, `Grapheme`, and an `Iterator` to iterate over them in a string.
@@ -115,6 +122,7 @@ test "Grapheme cluster iterator" {
115	```	122	```
116		123
117	## Unicode General Categories	124	## Unicode General Categories
		125
118	To detect the general category for a code point, use the `GenCatData` module.	126	To detect the general category for a code point, use the `GenCatData` module.
119		127
120	In your `build.zig`:	128	In your `build.zig`:
@@ -152,6 +160,7 @@ test "General Category" {
152	```	160	```
153		161
154	## Unicode Properties	162	## Unicode Properties
		163
155	You can detect common properties of a code point with the `PropsData` module.	164	You can detect common properties of a code point with the `PropsData` module.
156		165
157	In your `build.zig`:	166	In your `build.zig`:
@@ -182,7 +191,7 @@ test "Properties" {
182	// Accents, dieresis, and other combining marks.	191	// Accents, dieresis, and other combining marks.
183	try expect(pd.isDiacritic('\u{301}'));	192	try expect(pd.isDiacritic('\u{301}'));
184		193
185	// Unicode has a specification for valid identifiers like	194	// Unicode has a specification for valid identifiers like
186	// the ones used in programming and regular expressions.	195	// the ones used in programming and regular expressions.
187	try expect(pd.isIdStart('Z')); // Identifier start character	196	try expect(pd.isIdStart('Z')); // Identifier start character
188	try expect(!pd.isIdStart('1'));	197	try expect(!pd.isIdStart('1'));
@@ -204,6 +213,7 @@ test "Properties" {
204	```	213	```
205		214
206	## Letter Case Detection and Conversion	215	## Letter Case Detection and Conversion
		216
207	To detect and convert to and from different letter cases, use the `CaseData`	217	To detect and convert to and from different letter cases, use the `CaseData`
208	module.	218	module.
209		219
@@ -246,7 +256,8 @@ test "Case" {
246	```	256	```
247		257
248	## Normalization	258	## Normalization
249	Unicode normalization is the process of converting a string into a uniform	259
		260	Unicode normalization is the process of converting a string into a uniform
250	representation that can guarantee a known structure by following a strict set	261	representation that can guarantee a known structure by following a strict set
251	of rules. There are four normalization forms:	262	of rules. There are four normalization forms:
252		263
@@ -260,14 +271,14 @@ by first decomposing to Compatibility Decomposition and then composing to NFKC.
260		271
261	Canonical Decomposition (NFD)	272	Canonical Decomposition (NFD)
262	: Only code points with canonical decompositions	273	: Only code points with canonical decompositions
263	are decomposed. This is a more compact and faster decomposition but will not	274	are decomposed. This is a more compact and faster decomposition but will not
264	provide the most comprehensive normalization possible.	275	provide the most comprehensive normalization possible.
265		276
266	Compatibility Decomposition (NFKD)	277	Compatibility Decomposition (NFKD)
267	: The most comprehensive decomposition method	278	: The most comprehensive decomposition method
268	where both canonical and compatibility decompositions are performed recursively.	279	where both canonical and compatibility decompositions are performed recursively.
269		280
270	zg has methods to produce all four normalization forms in the `Normalize` module.	281	zg has methods to produce all four normalization forms in the `Normalize` module.
271		282
272	In your `build.zig`:	283	In your `build.zig`:
273		284
@@ -316,6 +327,7 @@ test "Normalization" {
316	```	327	```
317		328
318	## Caseless Matching via Case Folding	329	## Caseless Matching via Case Folding
		330
319	Unicode provides a more efficient way of comparing strings while ignoring letter	331	Unicode provides a more efficient way of comparing strings while ignoring letter
320	case differences: case folding. When you case fold a string, it's converted into a	332	case differences: case folding. When you case fold a string, it's converted into a
321	normalized case form suitable for efficient matching. Use the `CaseFold` module	333	normalized case form suitable for efficient matching. Use the `CaseFold` module
@@ -365,10 +377,11 @@ test "Caseless matching" {
365	```	377	```
366		378
367	## Display Width of Characters and Strings	379	## Display Width of Characters and Strings
		380
368	When displaying text with a fixed-width font on a terminal screen, it's very	381	When displaying text with a fixed-width font on a terminal screen, it's very
369	important to know exactly how many columns or cells each character should take.	382	important to know exactly how many columns or cells each character should take.
370	Most characters will use one column, but there are many, like emoji and East-	383	Most characters will use one column, but there are many, like emoji and East-
371	Asian ideographs that need more space. The `DisplayWidth` module provides	384	Asian ideographs that need more space. The `DisplayWidth` module provides
372	methods for this purpose. It also has methods that use the display width calculation	385	methods for this purpose. It also has methods that use the display width calculation
373	to `center`, `padLeft`, `padRight`, and `wrap` text.	386	to `center`, `padLeft`, `padRight`, and `wrap` text.
374		387
@@ -418,18 +431,29 @@ test "Display width" {
418	const wrapped = try dw.wrap(allocator, input, 10, 3);	431	const wrapped = try dw.wrap(allocator, input, 10, 3);
419	defer allocator.free(wrapped);	432	defer allocator.free(wrapped);
420	const want =	433	const want =
421	\\The quick	434	\\The quick
422	\\brown fox	435	\\brown fox
423	\\jumped	436	\\jumped
424	\\over the	437	\\over the
425	\\lazy dog!	438	\\lazy dog!
426	;	439	;
427	try expectEqualStrings(want, wrapped);	440	try expectEqualStrings(want, wrapped);
428	}	441	}
429	```	442	```
430		443
		444	This has a build option, `"cjk"`, which will consider [ambiguous characters](https://www.unicode.org/reports/tr11/tr11-6.html) as double-width.
		445
		446	To choose this option, add it to the dependency like so:
		447
		448	```zig
		449	const zg = b.dependency("zg", .{
		450	.cjk = true,
		451	});
		452	```
		453
431	## Scripts	454	## Scripts
432	Unicode categorizes code points by the Script in which they belong. A Script	455
		456	Unicode categorizes code points by the Script in which they belong. A Script
433	collects letters and other symbols that belong to a particular writing system.	457	collects letters and other symbols that belong to a particular writing system.
434	You can detect the Script for a code point with the `ScriptsData` module.	458	You can detect the Script for a code point with the `ScriptsData` module.
435		459
@@ -457,13 +481,14 @@ test "Scripts" {
457	```	481	```
458		482
459	## Relation to Ziglyph	483	## Relation to Ziglyph
		484
460	zg is a total re-write of some of the components of Ziglyph. The idea was to	485	zg is a total re-write of some of the components of Ziglyph. The idea was to
461	reduce binary size and improve performance. These goals were achieved by using	486	reduce binary size and improve performance. These goals were achieved by using
462	trie-like data structures (inspired by [Ghostty's implementation](https://mitchellh.com/writing/ghostty-devlog-006))	487	trie-like data structures (inspired by [Ghostty's implementation](https://mitchellh.com/writing/ghostty-devlog-006))
463	instead of generated functions. Where Ziglyph uses a function call, zg uses an	488	instead of generated functions. Where Ziglyph uses a function call, zg uses an
464	array lookup, which is quite faster. In addition, all these data structures in	489	array lookup, which is quite faster. In addition, all these data structures in
465	zg are loaded at runtime from compressed versions in the binary. This allows	490	zg are loaded at runtime from compressed versions in the binary. This allows
466	for smaller binary sizes at the expense of increased memory	491	for smaller binary sizes at the expense of increased memory
467	footprint at runtime.	492	footprint at runtime.
468		493
469	Benchmarks demonstrate the above stated goals have been met:	494	Benchmarks demonstrate the above stated goals have been met:
@@ -535,4 +560,3 @@ In contrast to Ziglyph, zg does not have:
535		560
536	It's possible that any missing functionality will be added in future versions,	561	It's possible that any missing functionality will be added in future versions,
537	but only if enough demand is present in the community.	562	but only if enough demand is present in the community.
538