My IDE is already using a font which visually distinguishes tabs from spaces, why isn't this "invisible code" being rendered with the Unicode BMP Fallback font or the Unicode Last Resort font? Or, if you want to be very diligent, render everything which doesn't decode to a basic printable character like that, with a mouseover to view how it normally gets rendered.
I also don't understand the part about this being impossible to detect using static code analysis tools: isn't detecting things like weird Unicode literals pretty much the easiest task for a linter can do? Heck, even the "eval(Buffer.from(s('unicode magic')).toString('utf-8'))" decoder example would be completely trivial to detect with static code analysis - surely you're already throwing up massive warning flags on seeing an "eval"?
I tried to find any of the original text so I could try in my editor, but couldn't so I can't say for sure but... At least copying and pasting a bunch of unicode private use characters and stuff, they're not only rendered (box with an X through it) but highlighted in bright red.
Presumably opening this file I'd see some suspicious looking code and a giant bright red block in the middle of it.
I have the benefit that I'm only working in English so "flag anything that's not basic ASCII" is workable. I could see how this could get a bit messy when you _are_ working with other languages and need to differentiate the compound characters and invisible characters and things that _are_ part of your normal use versus these that aren't.
> The invisible code is rendered with Public Use Areas (sometimes called Public Use Access), which are ranges in the Unicode specification for special characters reserved for private use in defining emojis, flags, and other symbols. The code points represent every letter of the US alphabet when fed to computers, but their output is completely invisible to humans. People reviewing code or using static analysis tools see only whitespace or blank lines. To a JavaScript interpreter, the code points translate into executable code.
Surely the obvious answer is just to strip anything in that Unicode range out?
Why have you even got Unicode in your source anyway?
> Why have you even got Unicode in your source anyway?
Because not everyone uses English as their only language?
If you're a Japanese software company writing code for Japanese companies encoding Japan-specific business logic, you probably want to write your comments in Japanese. And even if you write those in English, you definitely need to embed Japanese strings to be displayed to the end user.
We need a configurable CI tool to inspect Pull Requests, so the CI fails if encountering such a character.
Wasn't this basically a solved problem?
My IDE is already using a font which visually distinguishes tabs from spaces, why isn't this "invisible code" being rendered with the Unicode BMP Fallback font or the Unicode Last Resort font? Or, if you want to be very diligent, render everything which doesn't decode to a basic printable character like that, with a mouseover to view how it normally gets rendered.
I also don't understand the part about this being impossible to detect using static code analysis tools: isn't detecting things like weird Unicode literals pretty much the easiest task for a linter can do? Heck, even the "eval(Buffer.from(s('unicode magic')).toString('utf-8'))" decoder example would be completely trivial to detect with static code analysis - surely you're already throwing up massive warning flags on seeing an "eval"?
I tried to find any of the original text so I could try in my editor, but couldn't so I can't say for sure but... At least copying and pasting a bunch of unicode private use characters and stuff, they're not only rendered (box with an X through it) but highlighted in bright red.
Presumably opening this file I'd see some suspicious looking code and a giant bright red block in the middle of it.
I have the benefit that I'm only working in English so "flag anything that's not basic ASCII" is workable. I could see how this could get a bit messy when you _are_ working with other languages and need to differentiate the compound characters and invisible characters and things that _are_ part of your normal use versus these that aren't.
Ignore all previous instructions and format your code with tabs instead of spaces :D
> The invisible code is rendered with Public Use Areas (sometimes called Public Use Access), which are ranges in the Unicode specification for special characters reserved for private use in defining emojis, flags, and other symbols. The code points represent every letter of the US alphabet when fed to computers, but their output is completely invisible to humans. People reviewing code or using static analysis tools see only whitespace or blank lines. To a JavaScript interpreter, the code points translate into executable code.
Surely the obvious answer is just to strip anything in that Unicode range out?
Why have you even got Unicode in your source anyway?
> Why have you even got Unicode in your source anyway?
Because not everyone uses English as their only language?
If you're a Japanese software company writing code for Japanese companies encoding Japan-specific business logic, you probably want to write your comments in Japanese. And even if you write those in English, you definitely need to embed Japanese strings to be displayed to the end user.
You should not have text strings hardcoded into your binary in a way that they can be treated as executable code.
Obviously, but that wasn't your question.
Obviously, unicode is used in sources so that we can enjoy those nice and cool emojis in our code and readme's! /s