The recommended file extension is .langur.
Langur source code is always UTF-8. Anything else would be uncivilized (see UTF-8 Everywhere). A BOM (code point FEFF) is not allowed. At least some Linux shells (maybe all?) would miss a shebang line if a BOM were present.
Token whitespace allowed is the horizontal tab (9), space (16x20), and line feed (16x0A).
Note that I'm deprecating the use of Windows line endings (CR/LF combination) in favor of Linux line endings (LF only), which most operating systems use, anyway. Any decent plain text editor will allow you to specify the type of line endings to use. On Windows, I found Notepad++ useful, and there are plenty of others, I'm sure. Notepad, which comes with Windows, is not useful in this regard, and will not break at a line feed.
Identifiers are ASCII only. Allowing just anything for identifiers has too much potential for confusion.
Use a shebang line at the very start of the file to specify the interpreter to use, such as...
Langur has 2 types of comments, single-line comments and multi-line (or inline) comments.
# single-line comment started with hash mark /* multi-line (or inline) comment enclosed in C-style markers */
As of 0.6.11, comments allow characters designated as "Graphic" by Unicode, Spaces, and Private Use Area code points. Also, the following list of invisible "spaces," to make it easier to paste in international text. The idea around the "allowed" characters is to keep source code from having hidden text or codes and to allay confusion and deception. There may be more code points that need to be allowed.
invisible "spaces" list
U+180E (MONGOLIAN VOWEL SEPARATOR), U+200B (ZERO WIDTH SPACE), U+200C (ZERO WIDTH NON-JOINER), U+200D (ZERO WIDTH JOINER), U+200E (LEFT-TO-RIGHT MARK), U+202A (LEFT-TO-RIGHT EMBEDDING), U+202C (POP DIRECTIONAL FORMATTING), U+202D (LEFT-TO-RIGHT OVERRIDE), U+2060 (WORD JOINER)