Skip to main content
add usage of the proposed solution and highlight where solution part begins
Source Link

Problem

.split('') splits emojis in half.

Even Onur's solutions workonly works for some emojis, but can't handle more complex languages or combined emojis.

Consider this emoji being ruined:

[..."🏳️‍🌈"] // returns ["🏳", "️", "‍", "🌈"]  instead of ["🏳️‍🌈"]

Also consider this Hindi text अनुच्छेद which is split like this:

[..."अनुच्छेद"]  // returns   ["अ", "न", "ु", "च", "्", "छ", "े", "द"]

but should in fact be split like this:

["अ","नु","च्","छे","द"]

This happens because some of the characters are combining marks (think diacritics/accents in European languages).

Solution

You can use the grapheme-splitter library for this:

It does proper standards-based letter split in all the hundreds of exotic edge-cases - yes, there are that many.

Install:
$ npm install --save grapheme-splitter

Usage:

const splitter = new GraphemeSplitter();

// plain latin alphabet - nothing spectacular
splitter.splitGraphemes("abcd"); // returns ["a", "b", "c", "d"]

// two-char emojis and six-char combined emoji
splitter.splitGraphemes("🌷🎁💩😜👍🏳️‍🌈"); // returns ["🌷","🎁","💩","😜","👍","🏳️‍🌈"]

// diacritics as combining marks, 10 JavaScript chars
splitter.splitGraphemes("Ĺo͂řȩm̅"); // returns ["Ĺ","o͂","ř","ȩ","m̅"]

// individual Korean characters (Jamo), 4 JavaScript chars
splitter.splitGraphemes("뎌쉐"); // returns ["뎌","쉐"]

// Hindi text with combining marks, 8 JavaScript chars
splitter.splitGraphemes("अनुच्छेद"); // returns ["अ","नु","च्","छे","द"]

// demonic multiple combining marks, 75 JavaScript chars
splitter.splitGraphemes("Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞"); // returns ["Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍","A̴̵̜̰͔ͫ͗͢","L̠ͨͧͩ͘","G̴̻͈͍͔̹̑͗̎̅͛́","Ǫ̵̹̻̝̳͂̌̌͘","!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞"]

.split('') splits emojis in half.

Onur's solutions work for some emojis, but can't handle more complex languages or combined emojis.

Consider this emoji being ruined:

[..."🏳️‍🌈"] // returns ["🏳", "️", "‍", "🌈"]  instead of ["🏳️‍🌈"]

Also consider this Hindi text अनुच्छेद which is split like this:

[..."अनुच्छेद"]  // returns   ["अ", "न", "ु", "च", "्", "छ", "े", "द"]

but should in fact be split like this:

["अ","नु","च्","छे","द"]

This happens because some of the characters are combining marks (think diacritics/accents in European languages).

You can use the grapheme-splitter library for this:

It does proper standards-based letter split in all the hundreds of exotic edge-cases - yes, there are that many.

Problem

.split('') splits emojis in half.

Even Onur's solutions only works for some emojis, but can't handle more complex languages or combined emojis.

Consider this emoji being ruined:

[..."🏳️‍🌈"] // returns ["🏳", "️", "‍", "🌈"]  instead of ["🏳️‍🌈"]

Also consider this Hindi text अनुच्छेद which is split like this:

[..."अनुच्छेद"]  // returns   ["अ", "न", "ु", "च", "्", "छ", "े", "द"]

but should in fact be split like this:

["अ","नु","च्","छे","द"]

This happens because some of the characters are combining marks (think diacritics/accents in European languages).

Solution

You can use the grapheme-splitter library for this:

It does proper standards-based letter split in all the hundreds of exotic edge-cases - yes, there are that many.

Install:
$ npm install --save grapheme-splitter

Usage:

const splitter = new GraphemeSplitter();

// plain latin alphabet - nothing spectacular
splitter.splitGraphemes("abcd"); // returns ["a", "b", "c", "d"]

// two-char emojis and six-char combined emoji
splitter.splitGraphemes("🌷🎁💩😜👍🏳️‍🌈"); // returns ["🌷","🎁","💩","😜","👍","🏳️‍🌈"]

// diacritics as combining marks, 10 JavaScript chars
splitter.splitGraphemes("Ĺo͂řȩm̅"); // returns ["Ĺ","o͂","ř","ȩ","m̅"]

// individual Korean characters (Jamo), 4 JavaScript chars
splitter.splitGraphemes("뎌쉐"); // returns ["뎌","쉐"]

// Hindi text with combining marks, 8 JavaScript chars
splitter.splitGraphemes("अनुच्छेद"); // returns ["अ","नु","च्","छे","द"]

// demonic multiple combining marks, 75 JavaScript chars
splitter.splitGraphemes("Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞"); // returns ["Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍","A̴̵̜̰͔ͫ͗͢","L̠ͨͧͩ͘","G̴̻͈͍͔̹̑͗̎̅͛́","Ǫ̵̹̻̝̳͂̌̌͘","!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞"]
formatting
Source Link
cobrexus
  • 4.9k
  • 5
  • 24
  • 53

.split('') would split.split('') splits emojis in half.

Onur's solutions and the regex's proposedOnur's solutions work for some emojis, but can't handle more complex languages or combined emojis. Consider

Consider this emoji being ruined:

[..."🏳️‍🌈"] // returns ["🏳", "️", "‍", "🌈"]  instead of ["🏳️‍🌈"]

Also consider this Hindi text "अनुच्छेद"अनुच्छेद which is split like this:

[..."अनुच्छेद"]  // returns   ["अ", "न", "ु", "च", "्", "छ", "े", "द"]

but should in fact be split like this:

["अ","नु","च्","छे","द"]

This happens because some of the characters are combining marks (think diacritics/accents in European languages).

You can use the grapheme-splitter library for this:

https://github.com/orling/grapheme-splitter library for this:

It does proper standards-based letter split in all the hundreds of exotic edge-cases - yes, there are that many.

.split('') would split emojis in half.

Onur's solutions and the regex's proposed work for some emojis, but can't handle more complex languages or combined emojis. Consider this emoji being ruined:

[..."🏳️‍🌈"] // returns ["🏳", "️", "‍", "🌈"]  instead of ["🏳️‍🌈"]

Also consider this Hindi text "अनुच्छेद" which is split like this:

[..."अनुच्छेद"]  // returns   ["अ", "न", "ु", "च", "्", "छ", "े", "द"]

but should in fact be split like this:

["अ","नु","च्","छे","द"]

because some of the characters are combining marks (think diacritics/accents in European languages).

You can use the grapheme-splitter library for this:

https://github.com/orling/grapheme-splitter

It does proper standards-based letter split in all the hundreds of exotic edge-cases - yes, there are that many.

.split('') splits emojis in half.

Onur's solutions work for some emojis, but can't handle more complex languages or combined emojis.

Consider this emoji being ruined:

[..."🏳️‍🌈"] // returns ["🏳", "️", "‍", "🌈"]  instead of ["🏳️‍🌈"]

Also consider this Hindi text अनुच्छेद which is split like this:

[..."अनुच्छेद"]  // returns   ["अ", "न", "ु", "च", "्", "छ", "े", "द"]

but should in fact be split like this:

["अ","नु","च्","छे","द"]

This happens because some of the characters are combining marks (think diacritics/accents in European languages).

You can use the grapheme-splitter library for this:

It does proper standards-based letter split in all the hundreds of exotic edge-cases - yes, there are that many.

Added link to referenced answer.
Source Link
Trisped
  • 6k
  • 2
  • 51
  • 65

.split('') would split emojis in half.

Onur's solutionsOnur's solutions and the regex's proposed work for some emojis, but can't handle more complex languages or combined emojis. Consider this emoji being ruined:

[..."🏳️‍🌈"] // returns ["🏳", "️", "‍", "🌈"]  instead of ["🏳️‍🌈"]

Also consider this Hindi text "अनुच्छेद" which is split like this:

[..."अनुच्छेद"]  // returns   ["अ", "न", "ु", "च", "्", "छ", "े", "द"]

but should in fact be split like this:

["अ","नु","च्","छे","द"]

because some of the characters are combining marks (think diacritics/accents in European languages).

You can use the grapheme-splitter library for this:

https://github.com/orling/grapheme-splitter

It does proper standards-based letter split in all the hundreds of exotic edge-cases - yes, there are that many.

.split('') would split emojis in half.

Onur's solutions and the regex's proposed work for some emojis, but can't handle more complex languages or combined emojis. Consider this emoji being ruined:

[..."🏳️‍🌈"] // returns ["🏳", "️", "‍", "🌈"]  instead of ["🏳️‍🌈"]

Also consider this Hindi text "अनुच्छेद" which is split like this:

[..."अनुच्छेद"]  // returns   ["अ", "न", "ु", "च", "्", "छ", "े", "द"]

but should in fact be split like this:

["अ","नु","च्","छे","द"]

because some of the characters are combining marks (think diacritics/accents in European languages).

You can use the grapheme-splitter library for this:

https://github.com/orling/grapheme-splitter

It does proper standards-based letter split in all the hundreds of exotic edge-cases - yes, there are that many.

.split('') would split emojis in half.

Onur's solutions and the regex's proposed work for some emojis, but can't handle more complex languages or combined emojis. Consider this emoji being ruined:

[..."🏳️‍🌈"] // returns ["🏳", "️", "‍", "🌈"]  instead of ["🏳️‍🌈"]

Also consider this Hindi text "अनुच्छेद" which is split like this:

[..."अनुच्छेद"]  // returns   ["अ", "न", "ु", "च", "्", "छ", "े", "द"]

but should in fact be split like this:

["अ","नु","च्","छे","द"]

because some of the characters are combining marks (think diacritics/accents in European languages).

You can use the grapheme-splitter library for this:

https://github.com/orling/grapheme-splitter

It does proper standards-based letter split in all the hundreds of exotic edge-cases - yes, there are that many.

Source Link
Orlin Georgiev
  • 1.5k
  • 17
  • 18
Loading