Unicode Combining Forms, UTF-8 Encoding, and Combining Characters in Nvim

ב֟ 0x59f CP qarnei-parah ב֪ 0x5aa Cy yerach-ben-yomo ב֫ 0x5ab Co ole ב֬ 0x5ac Ci iluy ב֭ 0x5ad Cd dehi ב֮ 0x5ae Cn zinor ב֯ 0x5af CC masora circle Combining forms: ﬠ 0xfb20 X` Alternative ayin ﬡ 0xfb21 X' Alternative alef ﬢ 0xfb22 X-d Alternative dalet ﬣ 0xfb23 X-h Alternative he ﬤ 0xfb24 X-k Alternative kaf ﬥ 0xfb25 X-l Alternative lamed ﬦ 0xfb26 X-m Alternative mem-sofit ﬧ 0xfb27 X-r Alternative resh ﬨ 0xfb28 X-t Alternative tav ﬩ 0xfb29 X-+ Alternative plus שׁ 0xfb2a XW shin+shin-dot שׂ 0xfb2b Xw shin+sin-dot שּׁ 0xfb2c X..W shin+shin-dot+dagesh שּׂ 0xfb2d X..w shin+sin-dot+dagesh אַ 0xfb2e XA alef+patah אָ 0xfb2f XO alef+qamats אּ 0xfb30 XI alef+hiriq (mapiq) בּ 0xfb31 X.b bet+dagesh גּ 0xfb32 X.g gimel+dagesh דּ 0xfb33 X.d dalet+dagesh הּ 0xfb34 X.h he+dagesh וּ 0xfb35 Xu vav+dagesh זּ 0xfb36 X.z zayin+dagesh טּ 0xfb38 X.T tet+dagesh יּ 0xfb39 X.y yud+dagesh ךּ 0xfb3a X.K kaf sofit+dagesh כּ 0xfb3b X.k kaf+dagesh לּ 0xfb3c X.l lamed+dagesh מּ 0xfb3e X.m mem+dagesh נּ 0xfb40 X.n nun+dagesh סּ 0xfb41 X.s samech+dagesh ףּ 0xfb43 X.P pe sofit+dagesh פּ 0xfb44 X.p pe+dagesh צּ 0xfb46 X.x tsadi+dagesh קּ 0xfb47 X.q qof+dagesh רּ 0xfb48 X.r resh+dagesh שּ 0xfb49 X.w shin+dagesh תּ 0xfb4a X.t tav+dagesh וֹ 0xfb4b Xo vav+holam בֿ 0xfb4c XRb bet+rafe כֿ 0xfb4d XRk kaf+rafe פֿ 0xfb4e XRp pe+rafe ﭏ 0xfb4f Xal alef-lamed ============================================================================== Using UTF-8 *mbyte-utf8* *UTF-8* *utf-8* *utf8* *Unicode* *unicode* The Unicode character set was designed to include all characters from other character sets. Therefore it is possible to write text in (almost) any language using Unicode. And it's mostly possible to mix these languages in one file, which is impossible with other encodings. Unicode can be encoded in several ways. The most popular one is UTF-8, which uses one or more bytes for each character and is backwards compatible with ASCII. On MS-Windows UTF-16 is also used (previously UCS-2), which uses 16-bit words. Nvim supports all of these encodings, but always uses UTF-8 internally. Nvim supports double-width characters; works best with 'guifontwide'. When using only 'guifont' the wide characters are drawn in the normal width and a space to fill the gap. EMOJI *emoji* You can list emoji characters using this script: >vim :source $VIMRUNTIME/scripts/emoji_list.lua < *bom-bytes* When reading a file a BOM (Byte Order Mark) can be used to recognize the Unicode encoding: EF BB BF UTF-8 FE FF UTF-16 big endian FF FE UTF-16 little endian 00 00 FE FF UTF-32 big endian FF FE 00 00 UTF-32 little endian UTF-8 is the recommended encoding. Note that it's difficult to tell UTF-16 and UTF-32 apart. UTF-16 is often used on MS-Windows, UTF-32 is not widespread as file format. *mbyte-combining* *mbyte-composing* A composing or combining character is used to change the meaning of the character before it. The combining characters are drawn on top of the preceding character. Nvim largely follows the definition of extended grapheme clusters in UAX#29 in the Unicode standard, with some modifications: An ascii char will always start a new cluster. In addition 'arabicshape' enables the combining of some arabic letters, when they are shaped to be displayed together in a single cell. Too big combined characters cannot be displayed, but they can still be inspected using the |g8| and |ga| commands described below. When editing text a composing character is mostly considered part of the preceding character. For example "x" will delete a character and its following composing characters by default. If the 'delcombine' option is on, then pressing 'x' will delete the combining characters, one at a time,

This section lists Unicode combining forms with their UTF-8 encoding and keymap representations. It then discusses UTF-8 encoding in Nvim, highlighting its compatibility with ASCII and its support for double-width characters and emojis. It also describes how Nvim handles combining characters, drawing them on top of preceding characters, and explains the behavior of editing commands like 'x' when dealing with these characters, influenced by the 'delcombine' option.