Merge branch 'develop-next'HEAD v0.14.1 master

author: Sam Atman 2025-07-08 12:15:32 -0400
committer: Sam Atman 2025-07-08 12:15:32 -0400
commit: 9427a9e53aaa29ee071f4dcb35b809a699d75aa9 (patch)
tree: 2607c185fd8053b84d60041fadc35c05a0225d34 /NEWS.md
parent: Merge pull request 'Fix benchmarks' (#56) from jacobsandlund/zg:benchmarks in... (diff)
parent: Add Words.zig example to README (diff)
download: zg-master.tar.gz
zg-master.tar.xz
zg-master.zip
1 files changed, 111 insertions, 3 deletions
diff --git a/NEWS.md b/NEWS.md
index a432c2f..0ccf151 100644
--- a/NEWS.md
+++ b/NEWS.md
@@ -1,5 +1,113 @@
 # News
+## zg v0.14.1 Release Notes
+In a flurry of activity during and after the `v0.14.0` beta, several
+features were added (including from a new contributor!), and a bug
+fixed.
+Presenting `zg v0.14.1`.  As should be expected from a patch release,
+there are no breaking changes to the interface, just bug fixes and
+features.
+### Grapheme Zalgo Text Bugfix
+Until this release, `zg` was using a `u8` to store the length of a
+`Grapheme`.  While this is much larger than any "real" grapheme, the
+Unicode grapheme segmentation algorithm allows graphemes of arbitrary
+size to be constructed, often called [Zalgo text][Zalgo] after a
+notorious and funny Stack Overflow answer making use of this affordance.
+Therefore, a crafted string could force an integer overflow, with all that
+comes with it.  The `.len` field of a `Grapheme` is now a `u32`, like the
+`.offset` field.  Due to padding, the `Grapheme` is the same size as it
+was, just making use of the entire 8 bytes.
+Actually, both fields are now `uoffset`, for reasons described next.
+[Zalgo]: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
+### Limits Section Added to README
+The README now clearly documents that some data structures and iterators
+in `zg` use a `u32`.  I've also made it possible to configure the library
+to use a `u64` instead, and have included an explanation of why this is
+not the solution to actual problems which it at first might seem.
+My job as maintainer is to provide a useful library to the community, and
+comptime makes it easy and pleasant to tailor types to purpose. So for those
+who see a need for `u64` values in those structures, just pass `-Dfat_offset`
+or its equivalent, and you'll have them.
+I believe this to be neither necessary nor sufficient for handling data of
+that size.  But I can't anticipate every requirement, and don't want to
+preclude it as a solution.
+### Iterators, Back and Forth
+A new contributor, Nemoos, took on the challenge of adding a reverse
+iterator to `Graphemes`.  Thanks Nemoos!
+I've taken the opportunity to fill in a few bits of functionality to
+flesh these out.  `code_point` now has a reverse iterator as well, and
+either a forward or backward iterator can be reversed in-place.
+Reversing an iterator will always return the last non-`null` result
+of calling that iterator.  This is the only sane behavior, but
+might be a bit unexpected without prior warning.
+There's also `codePointAtIndex` and `graphemeAtIndex`.  These can be
+given any index which falls within the Grapheme or Codepoint which
+is returned.  These always return a value, and therefore cannot be
+called on an empty string.
+Finally, `Graphemes.iterateAfterGrapheme(string, grapheme)` will
+return a forward iterator which will yield the grapheme after
+`grapheme` when first called.  `iterateBeforeGrapheme` has the
+signature and result one might expect from this.
+`code_point` doesn't have an equivalent of those, since it isn't
+useful: codepoints are one to four bytes in length, while obtaining
+a grapheme reliably, given only an index, involves some pretty tricky
+business to get right.  The `Graphemes` API just described allows
+code to obtain a Grapheme cursor and then begin iterating in either
+direction, by calling `graphemeAtIndex` and providing it to either
+of those functions.  For codepoints, starting an iterator at either
+`.offset` or `.offset + .len` will suffice, since the `CodePoint`
+iterator is otherwise stateless.
+### Words Module
+The [Unicode annex][tr29] with the canonical grapheme segmentation
+algorithm also includes algorithms for word and sentence segmentation.
+`v0.14.1` includes an implementation of the word algorithm.
+It works like `Graphemes`.  There's forward and reverse iteration,
+`wordAtIndex`, and `iterate(Before|After)Word`.
+If anyone is looking for a challenge, there are open issues for sentence
+segmentation and [line breaking][tr14].
+[tr29]: https://www.unicode.org/reports/tr29/
+[tr14]: https://www.unicode.org/reports/tr14/
+#### Runeset Used
+As a point of interest:
+Most of the rules in the word breaking algorithm come from a distinct
+property table, `WordBreakProperties.txt` from the [UCD][UCD].  These
+are made into a data structure familiar from the other modules.
+One rule, WB3c, uses the Extended Pictographic property.  This is also
+used in `Graphemes`, but to avoid a dependency on that library, I used
+a [Runeset][Rune].  This is included statically, with only just as much
+code as needed to recognize the sequences; `zg` itself remains free of
+transitive dependencies.
+[UCD]: https://www.unicode.org/reports/tr44/
+[Rune]: https://github.com/mnemnion/runeset
 ## zg v0.14.0 Release Notes
 This is the first minor point release since Sam Atman (me) took over
@@ -52,9 +160,9 @@ UTF-8 into codepoints.  Concerningly, this interpreted overlong
 sequences, which has been forbidden by Unicode for more than 20 years
 due to the security risks involved.
-This has been replaced with a DFA decoder based on the work of [Björn
+This has been replaced with a DFA decoder based on the work of
-Höhrmann][UTF], which has proven itself fast[^1] and reliable.  This is
+[Björn Höhrmann][UTF], which has proven itself fast[^1] and reliable.
-a breaking change; sequences such as `"\xc0\xaf"` will no longer
+This is a breaking change; sequences such as `"\xc0\xaf"` will no longer
 produce the code `'/'`, nor will surrogates return their codepoint
 value.
author	Sam Atman	2025-07-08 12:15:32 -0400
committer	Sam Atman	2025-07-08 12:15:32 -0400
commit	9427a9e53aaa29ee071f4dcb35b809a699d75aa9 (patch)
tree	2607c185fd8053b84d60041fadc35c05a0225d34 /NEWS.md
parent	Merge pull request 'Fix benchmarks' (#56) from jacobsandlund/zg:benchmarks in... (diff)
parent	Add Words.zig example to README (diff)
download	zg-master.tar.gz zg-master.tar.xz zg-master.zip

diff --git a/NEWS.md b/NEWS.md index a432c2f..0ccf151 100644 --- a/NEWS.md +++ b/NEWS.md
@@ -1,5 +1,113 @@
1	# News	1	# News
2		2
		3	## zg v0.14.1 Release Notes
		4
		5	In a flurry of activity during and after the `v0.14.0` beta, several
		6	features were added (including from a new contributor!), and a bug
		7	fixed.
		8
		9	Presenting `zg v0.14.1`. As should be expected from a patch release,
		10	there are no breaking changes to the interface, just bug fixes and
		11	features.
		12
		13	### Grapheme Zalgo Text Bugfix
		14
		15	Until this release, `zg` was using a `u8` to store the length of a
		16	`Grapheme`. While this is much larger than any "real" grapheme, the
		17	Unicode grapheme segmentation algorithm allows graphemes of arbitrary
		18	size to be constructed, often called [Zalgo text][Zalgo] after a
		19	notorious and funny Stack Overflow answer making use of this affordance.
		20
		21	Therefore, a crafted string could force an integer overflow, with all that
		22	comes with it. The `.len` field of a `Grapheme` is now a `u32`, like the
		23	`.offset` field. Due to padding, the `Grapheme` is the same size as it
		24	was, just making use of the entire 8 bytes.
		25
		26	Actually, both fields are now `uoffset`, for reasons described next.
		27
		28	[Zalgo]: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
		29
		30	### Limits Section Added to README
		31
		32	The README now clearly documents that some data structures and iterators
		33	in `zg` use a `u32`. I've also made it possible to configure the library
		34	to use a `u64` instead, and have included an explanation of why this is
		35	not the solution to actual problems which it at first might seem.
		36
		37	My job as maintainer is to provide a useful library to the community, and
		38	comptime makes it easy and pleasant to tailor types to purpose. So for those
		39	who see a need for `u64` values in those structures, just pass `-Dfat_offset`
		40	or its equivalent, and you'll have them.
		41
		42	I believe this to be neither necessary nor sufficient for handling data of
		43	that size. But I can't anticipate every requirement, and don't want to
		44	preclude it as a solution.
		45
		46	### Iterators, Back and Forth
		47
		48	A new contributor, Nemoos, took on the challenge of adding a reverse
		49	iterator to `Graphemes`. Thanks Nemoos!
		50
		51	I've taken the opportunity to fill in a few bits of functionality to
		52	flesh these out. `code_point` now has a reverse iterator as well, and
		53	either a forward or backward iterator can be reversed in-place.
		54
		55	Reversing an iterator will always return the last non-`null` result
		56	of calling that iterator. This is the only sane behavior, but
		57	might be a bit unexpected without prior warning.
		58
		59	There's also `codePointAtIndex` and `graphemeAtIndex`. These can be
		60	given any index which falls within the Grapheme or Codepoint which
		61	is returned. These always return a value, and therefore cannot be
		62	called on an empty string.
		63
		64	Finally, `Graphemes.iterateAfterGrapheme(string, grapheme)` will
		65	return a forward iterator which will yield the grapheme after
		66	`grapheme` when first called. `iterateBeforeGrapheme` has the
		67	signature and result one might expect from this.
		68
		69	`code_point` doesn't have an equivalent of those, since it isn't
		70	useful: codepoints are one to four bytes in length, while obtaining
		71	a grapheme reliably, given only an index, involves some pretty tricky
		72	business to get right. The `Graphemes` API just described allows
		73	code to obtain a Grapheme cursor and then begin iterating in either
		74	direction, by calling `graphemeAtIndex` and providing it to either
		75	of those functions. For codepoints, starting an iterator at either
		76	`.offset` or `.offset + .len` will suffice, since the `CodePoint`
		77	iterator is otherwise stateless.
		78
		79	### Words Module
		80
		81	The [Unicode annex][tr29] with the canonical grapheme segmentation
		82	algorithm also includes algorithms for word and sentence segmentation.
		83	`v0.14.1` includes an implementation of the word algorithm.
		84
		85	It works like `Graphemes`. There's forward and reverse iteration,
		86	`wordAtIndex`, and `iterate(Before\|After)Word`.
		87
		88	If anyone is looking for a challenge, there are open issues for sentence
		89	segmentation and [line breaking][tr14].
		90
		91	[tr29]: https://www.unicode.org/reports/tr29/
		92	[tr14]: https://www.unicode.org/reports/tr14/
		93
		94	#### Runeset Used
		95
		96	As a point of interest:
		97
		98	Most of the rules in the word breaking algorithm come from a distinct
		99	property table, `WordBreakProperties.txt` from the [UCD][UCD]. These
		100	are made into a data structure familiar from the other modules.
		101
		102	One rule, WB3c, uses the Extended Pictographic property. This is also
		103	used in `Graphemes`, but to avoid a dependency on that library, I used
		104	a [Runeset][Rune]. This is included statically, with only just as much
		105	code as needed to recognize the sequences; `zg` itself remains free of
		106	transitive dependencies.
		107
		108	[UCD]: https://www.unicode.org/reports/tr44/
		109	[Rune]: https://github.com/mnemnion/runeset
		110
3	## zg v0.14.0 Release Notes	111	## zg v0.14.0 Release Notes
4		112
5	This is the first minor point release since Sam Atman (me) took over	113	This is the first minor point release since Sam Atman (me) took over
@@ -52,9 +160,9 @@ UTF-8 into codepoints. Concerningly, this interpreted overlong
52	sequences, which has been forbidden by Unicode for more than 20 years	160	sequences, which has been forbidden by Unicode for more than 20 years
53	due to the security risks involved.	161	due to the security risks involved.
54		162
55	This has been replaced with a DFA decoder based on the work of [Björn	163	This has been replaced with a DFA decoder based on the work of
56	Höhrmann][UTF], which has proven itself fast[^1] and reliable. This is	164	[Björn Höhrmann][UTF], which has proven itself fast[^1] and reliable.
57	a breaking change; sequences such as `"\xc0\xaf"` will no longer	165	This is a breaking change; sequences such as `"\xc0\xaf"` will no longer
58	produce the code `'/'`, nor will surrogates return their codepoint	166	produce the code `'/'`, nor will surrogates return their codepoint
59	value.	167	value.
60		168