diff options
| author | 2025-07-08 12:12:20 -0400 | |
|---|---|---|
| committer | 2025-07-08 12:12:20 -0400 | |
| commit | e3082e64b3ab8a8aa0777d63be69eb8b6d50a654 (patch) | |
| tree | 2607c185fd8053b84d60041fadc35c05a0225d34 | |
| parent | Add graphemeAtIndex + iterate before and after (diff) | |
| download | zg-e3082e64b3ab8a8aa0777d63be69eb8b6d50a654.tar.gz zg-e3082e64b3ab8a8aa0777d63be69eb8b6d50a654.tar.xz zg-e3082e64b3ab8a8aa0777d63be69eb8b6d50a654.zip | |
Add Words.zig example to README
| -rw-r--r-- | NEWS.md | 108 | ||||
| -rw-r--r-- | README.md | 156 | ||||
| -rw-r--r-- | src/Words.zig | 17 | ||||
| -rw-r--r-- | src/code_point.zig | 3 |
4 files changed, 257 insertions, 27 deletions
| @@ -1,5 +1,113 @@ | |||
| 1 | # News | 1 | # News |
| 2 | 2 | ||
| 3 | ## zg v0.14.1 Release Notes | ||
| 4 | |||
| 5 | In a flurry of activity during and after the `v0.14.0` beta, several | ||
| 6 | features were added (including from a new contributor!), and a bug | ||
| 7 | fixed. | ||
| 8 | |||
| 9 | Presenting `zg v0.14.1`. As should be expected from a patch release, | ||
| 10 | there are no breaking changes to the interface, just bug fixes and | ||
| 11 | features. | ||
| 12 | |||
| 13 | ### Grapheme Zalgo Text Bugfix | ||
| 14 | |||
| 15 | Until this release, `zg` was using a `u8` to store the length of a | ||
| 16 | `Grapheme`. While this is much larger than any "real" grapheme, the | ||
| 17 | Unicode grapheme segmentation algorithm allows graphemes of arbitrary | ||
| 18 | size to be constructed, often called [Zalgo text][Zalgo] after a | ||
| 19 | notorious and funny Stack Overflow answer making use of this affordance. | ||
| 20 | |||
| 21 | Therefore, a crafted string could force an integer overflow, with all that | ||
| 22 | comes with it. The `.len` field of a `Grapheme` is now a `u32`, like the | ||
| 23 | `.offset` field. Due to padding, the `Grapheme` is the same size as it | ||
| 24 | was, just making use of the entire 8 bytes. | ||
| 25 | |||
| 26 | Actually, both fields are now `uoffset`, for reasons described next. | ||
| 27 | |||
| 28 | [Zalgo]: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 | ||
| 29 | |||
| 30 | ### Limits Section Added to README | ||
| 31 | |||
| 32 | The README now clearly documents that some data structures and iterators | ||
| 33 | in `zg` use a `u32`. I've also made it possible to configure the library | ||
| 34 | to use a `u64` instead, and have included an explanation of why this is | ||
| 35 | not the solution to actual problems which it at first might seem. | ||
| 36 | |||
| 37 | My job as maintainer is to provide a useful library to the community, and | ||
| 38 | comptime makes it easy and pleasant to tailor types to purpose. So for those | ||
| 39 | who see a need for `u64` values in those structures, just pass `-Dfat_offset` | ||
| 40 | or its equivalent, and you'll have them. | ||
| 41 | |||
| 42 | I believe this to be neither necessary nor sufficient for handling data of | ||
| 43 | that size. But I can't anticipate every requirement, and don't want to | ||
| 44 | preclude it as a solution. | ||
| 45 | |||
| 46 | ### Iterators, Back and Forth | ||
| 47 | |||
| 48 | A new contributor, Nemoos, took on the challenge of adding a reverse | ||
| 49 | iterator to `Graphemes`. Thanks Nemoos! | ||
| 50 | |||
| 51 | I've taken the opportunity to fill in a few bits of functionality to | ||
| 52 | flesh these out. `code_point` now has a reverse iterator as well, and | ||
| 53 | either a forward or backward iterator can be reversed in-place. | ||
| 54 | |||
| 55 | Reversing an iterator will always return the last non-`null` result | ||
| 56 | of calling that iterator. This is the only sane behavior, but | ||
| 57 | might be a bit unexpected without prior warning. | ||
| 58 | |||
| 59 | There's also `codePointAtIndex` and `graphemeAtIndex`. These can be | ||
| 60 | given any index which falls within the Grapheme or Codepoint which | ||
| 61 | is returned. These always return a value, and therefore cannot be | ||
| 62 | called on an empty string. | ||
| 63 | |||
| 64 | Finally, `Graphemes.iterateAfterGrapheme(string, grapheme)` will | ||
| 65 | return a forward iterator which will yield the grapheme after | ||
| 66 | `grapheme` when first called. `iterateBeforeGrapheme` has the | ||
| 67 | signature and result one might expect from this. | ||
| 68 | |||
| 69 | `code_point` doesn't have an equivalent of those, since it isn't | ||
| 70 | useful: codepoints are one to four bytes in length, while obtaining | ||
| 71 | a grapheme reliably, given only an index, involves some pretty tricky | ||
| 72 | business to get right. The `Graphemes` API just described allows | ||
| 73 | code to obtain a Grapheme cursor and then begin iterating in either | ||
| 74 | direction, by calling `graphemeAtIndex` and providing it to either | ||
| 75 | of those functions. For codepoints, starting an iterator at either | ||
| 76 | `.offset` or `.offset + .len` will suffice, since the `CodePoint` | ||
| 77 | iterator is otherwise stateless. | ||
| 78 | |||
| 79 | ### Words Module | ||
| 80 | |||
| 81 | The [Unicode annex][tr29] with the canonical grapheme segmentation | ||
| 82 | algorithm also includes algorithms for word and sentence segmentation. | ||
| 83 | `v0.14.1` includes an implementation of the word algorithm. | ||
| 84 | |||
| 85 | It works like `Graphemes`. There's forward and reverse iteration, | ||
| 86 | `wordAtIndex`, and `iterate(Before|After)Word`. | ||
| 87 | |||
| 88 | If anyone is looking for a challenge, there are open issues for sentence | ||
| 89 | segmentation and [line breaking][tr14]. | ||
| 90 | |||
| 91 | [tr29]: https://www.unicode.org/reports/tr29/ | ||
| 92 | [tr14]: https://www.unicode.org/reports/tr14/ | ||
| 93 | |||
| 94 | #### Runeset Used | ||
| 95 | |||
| 96 | As a point of interest: | ||
| 97 | |||
| 98 | Most of the rules in the word breaking algorithm come from a distinct | ||
| 99 | property table, `WordBreakProperties.txt` from the [UCD][UCD]. These | ||
| 100 | are made into a data structure familiar from the other modules. | ||
| 101 | |||
| 102 | One rule, WB3c, uses the Extended Pictographic property. This is also | ||
| 103 | used in `Graphemes`, but to avoid a dependency on that library, I used | ||
| 104 | a [Runeset][Rune]. This is included statically, with only just as much | ||
| 105 | code as needed to recognize the sequences; `zg` itself remains free of | ||
| 106 | transitive dependencies. | ||
| 107 | |||
| 108 | [UCD]: https://www.unicode.org/reports/tr44/ | ||
| 109 | [Rune]: https://github.com/mnemnion/runeset | ||
| 110 | |||
| 3 | ## zg v0.14.0 Release Notes | 111 | ## zg v0.14.0 Release Notes |
| 4 | 112 | ||
| 5 | This is the first minor point release since Sam Atman (me) took over | 113 | This is the first minor point release since Sam Atman (me) took over |
| @@ -2,21 +2,24 @@ | |||
| 2 | 2 | ||
| 3 | zg provides Unicode text processing for Zig projects. | 3 | zg provides Unicode text processing for Zig projects. |
| 4 | 4 | ||
| 5 | |||
| 5 | ## Unicode Version | 6 | ## Unicode Version |
| 6 | 7 | ||
| 7 | The Unicode version supported by zg is `16.0.0`. | 8 | The Unicode version supported by zg is `16.0.0`. |
| 8 | 9 | ||
| 10 | |||
| 9 | ## Zig Version | 11 | ## Zig Version |
| 10 | 12 | ||
| 11 | The minimum Zig version required is `0.14`. | 13 | The minimum Zig version required is `0.14`. |
| 12 | 14 | ||
| 15 | |||
| 13 | ## Integrating zg into your Zig Project | 16 | ## Integrating zg into your Zig Project |
| 14 | 17 | ||
| 15 | You first need to add zg as a dependency in your `build.zig.zon` file. In your | 18 | You first need to add zg as a dependency in your `build.zig.zon` file. In your |
| 16 | Zig project's root directory, run: | 19 | Zig project's root directory, run: |
| 17 | 20 | ||
| 18 | ```plain | 21 | ```plain |
| 19 | zig fetch --save https://codeberg.org/atman/zg/archive/v0.14.0-rc1.tar.gz | 22 | zig fetch --save https://codeberg.org/atman/zg/archive/v0.14.1.tar.gz |
| 20 | ``` | 23 | ``` |
| 21 | 24 | ||
| 22 | Then instantiate the dependency in your `build.zig`: | 25 | Then instantiate the dependency in your `build.zig`: |
| @@ -25,12 +28,14 @@ Then instantiate the dependency in your `build.zig`: | |||
| 25 | const zg = b.dependency("zg", .{}); | 28 | const zg = b.dependency("zg", .{}); |
| 26 | ``` | 29 | ``` |
| 27 | 30 | ||
| 31 | |||
| 28 | ## A Modular Approach | 32 | ## A Modular Approach |
| 29 | 33 | ||
| 30 | zg is a modular library. This approach minimizes binary file size and memory | 34 | zg is a modular library. This approach minimizes binary file size and memory |
| 31 | requirements by only including the Unicode data required for the specified module. | 35 | requirements by only including the Unicode data required for the specified module. |
| 32 | The following sections describe the various modules and their specific use case. | 36 | The following sections describe the various modules and their specific use case. |
| 33 | 37 | ||
| 38 | |||
| 34 | ### Init and Setup | 39 | ### Init and Setup |
| 35 | 40 | ||
| 36 | The code examples will show the use of `Module.init(allocator)` to create the | 41 | The code examples will show the use of `Module.init(allocator)` to create the |
| @@ -67,7 +72,7 @@ const code_point = @import("code_point"); | |||
| 67 | 72 | ||
| 68 | test "Code point iterator" { | 73 | test "Code point iterator" { |
| 69 | const str = "Hi 😊"; | 74 | const str = "Hi 😊"; |
| 70 | var iter = code_point.Iterator{ .bytes = str }; | 75 | var iter: code_point.Iterator = .init(str); |
| 71 | var i: usize = 0; | 76 | var i: usize = 0; |
| 72 | 77 | ||
| 73 | while (iter.next()) |cp| : (i += 1) { | 78 | while (iter.next()) |cp| : (i += 1) { |
| @@ -78,25 +83,60 @@ test "Code point iterator" { | |||
| 78 | 83 | ||
| 79 | if (i == 3) { | 84 | if (i == 3) { |
| 80 | try expect(cp.code == '😊'); | 85 | try expect(cp.code == '😊'); |
| 81 | |||
| 82 | // The `offset` field is the byte offset in the | 86 | // The `offset` field is the byte offset in the |
| 83 | // source string. | 87 | // source string. |
| 84 | try expect(cp.offset == 3); | 88 | try expect(cp.offset == 3); |
| 85 | try expectEqual(code_point.CodePoint, code_point.decodeAtIndex(str, cp.offset)); | 89 | try expectEqual(cp, code_point.decodeAtIndex(str, cp.offset).?); |
| 86 | |||
| 87 | // The `len` field is the length in bytes of the | 90 | // The `len` field is the length in bytes of the |
| 88 | // code point in the source string. | 91 | // code point in the source string. |
| 89 | try expect(cp.len == 4); | 92 | try expect(cp.len == 4); |
| 93 | // There is also a 'cursor' decode, like so: | ||
| 94 | { | ||
| 95 | var cursor = cp.offset; | ||
| 96 | try expectEqual(cp, code_point.decodeAtCursor(str, &cursor).?); | ||
| 97 | // Which advances the cursor variable to the next possible | ||
| 98 | // offset, in this case, `str.len`. Don't forget to account | ||
| 99 | // for this possibility! | ||
| 100 | try expectEqual(cp.offset + cp.len, cursor); | ||
| 101 | } | ||
| 102 | // There's also this, for when you aren't sure if you have the | ||
| 103 | // correct start for a code point: | ||
| 104 | try expectEqual(cp, code_point.codepointAtIndex(str, cp.offset + 1).?); | ||
| 90 | } | 105 | } |
| 106 | // Reverse iteration is also an option: | ||
| 107 | var r_iter: code_point.ReverseIterator = .init(str); | ||
| 108 | // Both iterators can be peeked: | ||
| 109 | try expectEqual('😊', r_iter.peek().?.code); | ||
| 110 | try expectEqual('😊', r_iter.prev().?.code); | ||
| 111 | // Both kinds of iterators can be reversed: | ||
| 112 | var fwd_iter = r_iter.forwardIterator(); // or iter.reverseIterator(); | ||
| 113 | // This will always return the last codepoint from | ||
| 114 | // the prior iterator, _if_ it yielded one: | ||
| 115 | try expectEqual('😊', fwd_iter.next().?.code); | ||
| 91 | } | 116 | } |
| 92 | } | 117 | } |
| 93 | ``` | 118 | ``` |
| 94 | 119 | ||
| 120 | Note that it's safe to call CodePoint functions on invalid | ||
| 121 | UTF-8. Iterators and decode functions will return the Unicode | ||
| 122 | Replacement Character `U+FFFD`, according to the Substitution of Maximal | ||
| 123 | Subparts algorithm, for any invalid code unit sequences encountered. | ||
| 124 | |||
| 125 | |||
| 95 | ## Grapheme Clusters | 126 | ## Grapheme Clusters |
| 96 | 127 | ||
| 97 | Many characters are composed from more than one code point. These are known as | 128 | Many characters are composed from more than one code point. These |
| 98 | Grapheme Clusters, and the `Graphemes` module has a data structure to represent | 129 | are known as Grapheme Clusters, and the `Graphemes` module has a |
| 99 | them, `Grapheme`, and an `Iterator` to iterate over them in a string. | 130 | data structure to represent them, `Grapheme`, and an `Iterator` and |
| 131 | `ReverseIterator` to iterate over them in a string. | ||
| 132 | |||
| 133 | There is also `graphemeAtIndex`, which returns whatever grapheme | ||
| 134 | belongs to the index; this does not have to be on a valid grapheme | ||
| 135 | or codepoint boundary, but it is illegal to call on an empty string. | ||
| 136 | Last, `iterateAfterGrapheme` or `iterateBeforeGrapheme` will provide | ||
| 137 | forward or backward grapheme iterators of the string, from the grapheme | ||
| 138 | provided. Thus, given an index, you can begin forward or backward | ||
| 139 | iteration at that index without needing to slice the string. | ||
| 100 | 140 | ||
| 101 | In your `build.zig`: | 141 | In your `build.zig`: |
| 102 | 142 | ||
| @@ -139,6 +179,56 @@ test "Grapheme cluster iterator" { | |||
| 139 | } | 179 | } |
| 140 | ``` | 180 | ``` |
| 141 | 181 | ||
| 182 | |||
| 183 | ## Words | ||
| 184 | |||
| 185 | Unicode has a standard word segmentation algorithm, which gives good | ||
| 186 | results for most languages. Some languages, such as Thai, require a | ||
| 187 | dictionary to find the boundary between words; these cases are not | ||
| 188 | handled by the standard algorithm. | ||
| 189 | |||
| 190 | `zg` implements that algorithm in the `Words` module. As a note, | ||
| 191 | the iterators and functions provided here will yield segments which | ||
| 192 | are not a "word" in the conventional sense, but word _boundaries_. | ||
| 193 | Specifically, the iterators in this module will return every segment of | ||
| 194 | a string, ensuring that words are kept whole when encountered. If the | ||
| 195 | word breaks are of primary interest, you'll want to use the `.offset` | ||
| 196 | field of each iterated value, and handle `string.len` as the final case | ||
| 197 | when the iteration returns `null`. | ||
| 198 | |||
| 199 | The API is congruent with `Graphemes`: forward and backward iterators, | ||
| 200 | `wordAtIndex`, and `iterateAfter` and before. | ||
| 201 | |||
| 202 | In your `build.zig`: | ||
| 203 | |||
| 204 | ```zig | ||
| 205 | exe.root_module.addImport("Words", zg.module("Words")); | ||
| 206 | ``` | ||
| 207 | |||
| 208 | In your code: | ||
| 209 | |||
| 210 | ```zig | ||
| 211 | const Words = @import("Words"); | ||
| 212 | |||
| 213 | test "Words" { | ||
| 214 | const wb = try Words.init(testing.allocator); | ||
| 215 | defer wb.deinit(testing.allocator); | ||
| 216 | const word_str = "Metonym Μετωνύμιο メトニム"; | ||
| 217 | var w_iter = wb.iterator(word_str); | ||
| 218 | try testing.expectEqualStrings("Metonym", w_iter.next().?.bytes(word_str)); | ||
| 219 | // Spaces are "words" too! | ||
| 220 | try testing.expectEqualStrings(" ", w_iter.next().?.bytes(word_str)); | ||
| 221 | const in_greek = w_iter.next().?; | ||
| 222 | // wordAtIndex doesn't care if the index is valid for a codepoint: | ||
| 223 | for (in_greek.offset..in_greek.offset + in_greek.len) |i| { | ||
| 224 | const at_index = wb.wordAtIndex(word_str, i).bytes(word_str); | ||
| 225 | try testing.expectEqualStrings("Μετωνύμιο", at_index); | ||
| 226 | } | ||
| 227 | _ = w_iter.next(); | ||
| 228 | try testing.expectEqualStrings("メトニム", w_iter.next().?.bytes(word_str)); | ||
| 229 | } | ||
| 230 | ``` | ||
| 231 | |||
| 142 | ## Unicode General Categories | 232 | ## Unicode General Categories |
| 143 | 233 | ||
| 144 | To detect the general category for a code point, use the `GeneralCategories` module. | 234 | To detect the general category for a code point, use the `GeneralCategories` module. |
| @@ -279,24 +369,24 @@ Unicode normalization is the process of converting a string into a uniform | |||
| 279 | representation that can guarantee a known structure by following a strict set | 369 | representation that can guarantee a known structure by following a strict set |
| 280 | of rules. There are four normalization forms: | 370 | of rules. There are four normalization forms: |
| 281 | 371 | ||
| 282 | Canonical Composition (NFC) | 372 | **Canonical Composition (NFC)** |
| 283 | : The most compact representation obtained by first | 373 | : The most compact representation obtained by first |
| 284 | decomposing to Canonical Decomposition and then composing to NFC. | 374 | decomposing to Canonical Decomposition and then composing to NFC. |
| 285 | 375 | ||
| 286 | Compatibility Composition (NFKC) | 376 | **Compatibility Composition (NFKC)** |
| 287 | : The most comprehensive composition obtained | 377 | : The most comprehensive composition obtained |
| 288 | by first decomposing to Compatibility Decomposition and then composing to NFKC. | 378 | by first decomposing to Compatibility Decomposition and then composing to NFKC. |
| 289 | 379 | ||
| 290 | Canonical Decomposition (NFD) | 380 | **Canonical Decomposition (NFD)** |
| 291 | : Only code points with canonical decompositions | 381 | : Only code points with canonical decompositions |
| 292 | are decomposed. This is a more compact and faster decomposition but will not | 382 | are decomposed. This is a more compact and faster decomposition but will not |
| 293 | provide the most comprehensive normalization possible. | 383 | provide the most comprehensive normalization possible. |
| 294 | 384 | ||
| 295 | Compatibility Decomposition (NFKD) | 385 | **Compatibility Decomposition (NFKD)** |
| 296 | : The most comprehensive decomposition method | 386 | : The most comprehensive decomposition method |
| 297 | where both canonical and compatibility decompositions are performed recursively. | 387 | where both canonical and compatibility decompositions are performed recursively. |
| 298 | 388 | ||
| 299 | zg has methods to produce all four normalization forms in the `Normalize` module. | 389 | `zg` has methods to produce all four normalization forms in the `Normalize` module. |
| 300 | 390 | ||
| 301 | In your `build.zig`: | 391 | In your `build.zig`: |
| 302 | 392 | ||
| @@ -493,7 +583,7 @@ in the same fashion as shown for `CaseFolding` and `Normalize`. | |||
| 493 | 583 | ||
| 494 | ## Scripts | 584 | ## Scripts |
| 495 | 585 | ||
| 496 | Unicode categorizes code points by the Script in which they belong. A Script | 586 | Unicode categorizes code points by the Script in which they belong. A Script |
| 497 | collects letters and other symbols that belong to a particular writing system. | 587 | collects letters and other symbols that belong to a particular writing system. |
| 498 | You can detect the Script for a code point with the `Scripts` module. | 588 | You can detect the Script for a code point with the `Scripts` module. |
| 499 | 589 | ||
| @@ -522,21 +612,33 @@ test "Scripts" { | |||
| 522 | 612 | ||
| 523 | ## Limits | 613 | ## Limits |
| 524 | 614 | ||
| 525 | Iterators, and fragment types such as `CodePoint`, `Grapheme` and `Word`, use a | 615 | Iterators, and fragment types such as `CodePoint`, `Grapheme` and |
| 526 | `u32` to store the offset into a string, and the length of the fragment | 616 | `Word`, use a `u32` to store the offset into a string, and the length of |
| 527 | (`CodePoint` uses a `u3` for length, actually). | 617 | the fragment (`CodePoint` uses a `u3` for length, actually). |
| 528 | 618 | ||
| 529 | 4GiB is a lot of string. There are a few reasons to work with that much | 619 | 4GiB is a lot of string. There are a few reasons to work with that much |
| 530 | string, log files primarily, but fewer to bring it all into memory at once, and | 620 | string, log files primarily, but fewer to bring it all into memory at |
| 531 | practically no reason at all to do anything to such a string without breaking | 621 | once, and practically no reason at all to do anything to such a string |
| 532 | it into smaller piece to work with. | 622 | without breaking it into smaller piece to work with. |
| 533 | 623 | ||
| 534 | Also, Zig compiles on 32 bit systems, where `usize` is 32. Code running on | 624 | Also, Zig compiles on 32 bit systems, where `usize` is a `u32`. Code |
| 535 | such systems has no choice but to handle slices in smaller pieces. In general, | 625 | running on such systems has no choice but to handle slices in smaller |
| 536 | if you want code to perform correctly when encountering multi- gigabyte | 626 | pieces. In general, if you want code to perform correctly when |
| 537 | strings, you'll need to code for that, at a level one or two steps above that | 627 | encountering multi-gigabyte strings, you'll need to code for that, at a |
| 538 | in which you'll want to, for example, iterate some graphemes of that string. | 628 | level one or two steps above that in which you'll want to, for example, |
| 629 | iterate some graphemes of that string. | ||
| 539 | 630 | ||
| 540 | That all said, `zg` modules can be passed the Boolean config option | 631 | That all said, `zg` modules can be passed the Boolean config option |
| 541 | `fat_offset`, which will make all of those data structures use a `u64` instead. | 632 | `fat_offset`, which will make all of those data structures use a `u64` |
| 542 | You don't actually want to do this. But you can. | 633 | instead. I added this option not because you should use it, which you |
| 634 | should not, but to encourage awareness that code operating on strings | ||
| 635 | needs to pay attention to the size of those strings, and have a plan for | ||
| 636 | when sizes get out of specification. What would your code do with a | ||
| 637 | 1MiB region of string with no newline? There are many questions of this | ||
| 638 | nature, and robust code must detect when data is out of the expected | ||
| 639 | envelope, so it can respond accordingly. | ||
| 640 | |||
| 641 | Code which does pay attention to these questions has no need for `u64` | ||
| 642 | sized offsets, and code which does not will not be helped by them. But | ||
| 643 | perhaps yours is an exception, in which case, by all means, configure | ||
| 644 | accordingly. | ||
diff --git a/src/Words.zig b/src/Words.zig index af82562..617c34d 100644 --- a/src/Words.zig +++ b/src/Words.zig | |||
| @@ -674,6 +674,23 @@ test "ext_pict" { | |||
| 674 | try testing.expect(ext_pict.isMatch("\u{2701}")); | 674 | try testing.expect(ext_pict.isMatch("\u{2701}")); |
| 675 | } | 675 | } |
| 676 | 676 | ||
| 677 | test "Words" { | ||
| 678 | const wb = try Words.init(testing.allocator); | ||
| 679 | defer wb.deinit(testing.allocator); | ||
| 680 | const word_str = "Metonym Μετωνύμιο メトニム"; | ||
| 681 | var w_iter = wb.iterator(word_str); | ||
| 682 | try testing.expectEqualStrings("Metonym", w_iter.next().?.bytes(word_str)); | ||
| 683 | // Spaces are "words" too! | ||
| 684 | try testing.expectEqualStrings(" ", w_iter.next().?.bytes(word_str)); | ||
| 685 | const in_greek = w_iter.next().?; | ||
| 686 | for (in_greek.offset..in_greek.offset + in_greek.len) |i| { | ||
| 687 | const at_index = wb.wordAtIndex(word_str, i).bytes(word_str); | ||
| 688 | try testing.expectEqualStrings("Μετωνύμιο", at_index); | ||
| 689 | } | ||
| 690 | _ = w_iter.next(); | ||
| 691 | try testing.expectEqualStrings("メトニム", w_iter.next().?.bytes(word_str)); | ||
| 692 | } | ||
| 693 | |||
| 677 | test wordAtIndex { | 694 | test wordAtIndex { |
| 678 | const wb = try Words.init(testing.allocator); | 695 | const wb = try Words.init(testing.allocator); |
| 679 | defer wb.deinit(testing.allocator); | 696 | defer wb.deinit(testing.allocator); |
diff --git a/src/code_point.zig b/src/code_point.zig index 16648af..7a638af 100644 --- a/src/code_point.zig +++ b/src/code_point.zig | |||
| @@ -121,6 +121,9 @@ pub fn decodeAtCursor(bytes: []const u8, cursor: *uoffset) ?CodePoint { | |||
| 121 | } | 121 | } |
| 122 | if (st == RUNE_REJECT or cursor.* == bytes.len) { | 122 | if (st == RUNE_REJECT or cursor.* == bytes.len) { |
| 123 | @branchHint(.cold); | 123 | @branchHint(.cold); |
| 124 | // This, and the branch below, detect truncation, the | ||
| 125 | // only invalid state handled differently by the Maximal | ||
| 126 | // Subparts algorithm. | ||
| 124 | if (state_dfa[@intCast(u8dfa[byte])] == RUNE_REJECT) { | 127 | if (state_dfa[@intCast(u8dfa[byte])] == RUNE_REJECT) { |
| 125 | cursor.* -= 2; // +1 | 128 | cursor.* -= 2; // +1 |
| 126 | return .{ | 129 | return .{ |