租房小程序前端代码
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

132 lines
5.9 KiB

3 months ago
  1. # Graphemer: Unicode Character Splitter 🪓
  2. ## Introduction
  3. This library continues the work of [Grapheme Splitter](https://github.com/orling/grapheme-splitter) and supports the following unicode versions:
  4. - Unicode 15 and below `[v1.4.0]`
  5. - Unicode 14 and below `[v1.3.0]`
  6. - Unicode 13 and below `[v1.1.0]`
  7. - Unicode 11 and below `[v1.0.0]` (Unicode 10 supported by `grapheme-splitter`)
  8. In JavaScript there is not always a one-to-one relationship between string characters and what a user would call a separate visual "letter". Some symbols are represented by several characters. This can cause issues when splitting strings and inadvertently cutting a multi-char letter in half, or when you need the actual number of letters in a string.
  9. For example, emoji characters like "🌷","🎁","💩","😜" and "👍" are represented by two JavaScript characters each (high surrogate and low surrogate). That is,
  10. ```javascript
  11. '🌷'.length == 2;
  12. ```
  13. The combined emoji are even longer:
  14. ```javascript
  15. '🏳️‍🌈'.length == 6;
  16. ```
  17. What's more, some languages often include combining marks - characters that are used to modify the letters before them. Common examples are the German letter ü and the Spanish letter ñ. Sometimes they can be represented alternatively both as a single character and as a letter + combining mark, with both forms equally valid:
  18. ```javascript
  19. var two = 'ñ'; // unnormalized two-char n+◌̃, i.e. "\u006E\u0303";
  20. var one = 'ñ'; // normalized single-char, i.e. "\u00F1"
  21. console.log(one != two); // prints 'true'
  22. ```
  23. Unicode normalization, as performed by the popular punycode.js library or ECMAScript 6's String.normalize, can **sometimes** fix those differences and turn two-char sequences into single characters. But it is **not** enough in all cases. Some languages like Hindi make extensive use of combining marks on their letters, that have no dedicated single-codepoint Unicode sequences, due to the sheer number of possible combinations.
  24. For example, the Hindi word "अनुच्छेद" is comprised of 5 letters and 3 combining marks:
  25. अ + न + ु + च + ् + छ + े + द
  26. which is in fact just 5 user-perceived letters:
  27. अ + नु + च् + छे + द
  28. and which Unicode normalization would not combine properly.
  29. There are also the unusual letter+combining mark combinations which have no dedicated Unicode codepoint. The string Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘ obviously has 5 separate letters, but is in fact comprised of 58 JavaScript characters, most of which are combining marks.
  30. Enter the `graphemer` library. It can be used to properly split JavaScript strings into what a human user would call separate letters (or "extended grapheme clusters" in Unicode terminology), no matter what their internal representation is. It is an implementation on the [Default Grapheme Cluster Boundary](http://unicode.org/reports/tr29/#Default_Grapheme_Cluster_Table) of [UAX #29](http://www.unicode.org/reports/tr29/).
  31. ## Installation
  32. Install `graphemer` using the NPM command below:
  33. ```
  34. $ npm i graphemer
  35. ```
  36. ## Usage
  37. If you're using [Typescript](https://www.typescriptlang.org/) or a compiler like [Babel](https://babeljs.io/) (or something like Create React App) things are pretty simple; just import, initialize and use!
  38. ```javascript
  39. import Graphemer from 'graphemer';
  40. const splitter = new Graphemer();
  41. // split the string to an array of grapheme clusters (one string each)
  42. const graphemes = splitter.splitGraphemes(string);
  43. // iterate the string to an iterable iterator of grapheme clusters (one string each)
  44. const graphemeIterator = splitter.iterateGraphemes(string);
  45. // or do this if you just need their number
  46. const graphemeCount = splitter.countGraphemes(string);
  47. ```
  48. If you're using vanilla Node you can use the `require()` method.
  49. ```javascript
  50. const Graphemer = require('graphemer').default;
  51. const splitter = new Graphemer();
  52. const graphemes = splitter.splitGraphemes(string);
  53. ```
  54. ## Examples
  55. ```javascript
  56. import Graphemer from 'graphemer';
  57. const splitter = new Graphemer();
  58. // plain latin alphabet - nothing spectacular
  59. splitter.splitGraphemes('abcd'); // returns ["a", "b", "c", "d"]
  60. // two-char emojis and six-char combined emoji
  61. splitter.splitGraphemes('🌷🎁💩😜👍🏳️‍🌈'); // returns ["🌷","🎁","💩","😜","👍","🏳️‍🌈"]
  62. // diacritics as combining marks, 10 JavaScript chars
  63. splitter.splitGraphemes('Ĺo͂řȩm̅'); // returns ["Ĺ","o͂","ř","ȩ","m̅"]
  64. // individual Korean characters (Jamo), 4 JavaScript chars
  65. splitter.splitGraphemes('뎌쉐'); // returns ["뎌","쉐"]
  66. // Hindi text with combining marks, 8 JavaScript chars
  67. splitter.splitGraphemes('अनुच्छेद'); // returns ["अ","नु","च्","छे","द"]
  68. // demonic multiple combining marks, 75 JavaScript chars
  69. splitter.splitGraphemes('Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞'); // returns ["Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍","A̴̵̜̰͔ͫ͗͢","L̠ͨͧͩ͘","G̴̻͈͍͔̹̑͗̎̅͛́","Ǫ̵̹̻̝̳͂̌̌͘","!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞"]
  70. ```
  71. ## TypeScript
  72. Graphemer is built with TypeScript and, of course, includes type declarations.
  73. ```javascript
  74. import Graphemer from 'graphemer';
  75. const splitter = new Graphemer();
  76. const split: string[] = splitter.splitGraphemes('Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞');
  77. ```
  78. ## Contributing
  79. See [Contribution Guide](./CONTRIBUTING.md).
  80. ## Acknowledgements
  81. This library is a fork of the incredible work done by Orlin Georgiev and Huáng Jùnliàng at https://github.com/orling/grapheme-splitter.
  82. The original library was heavily influenced by Devon Govett's excellent [grapheme-breaker](https://github.com/devongovett/grapheme-breaker) CoffeeScript library.