Unicode is a global standard for representing text in computers. D fully supports Unicode in both the language and the standard library.
Computers, at the lowest level, have no notion of what text is, as they only deal with numbers. As a result, computer code needs a way to take text data and transform it to and from a binary representation. The method of transformation is called an encoding scheme, and Unicode is one such scheme.
To see the numerical representations underlying the strings in the example, simply run the code.
Unicode is unique in that its design allows it to represent all the languages of the world using the same encoding scheme. Before Unicode, computers made by different companies or shipped in different areas had a hard time communicating, and in some cases an encoding scheme wasn't supported at all, making viewing the text on that computer impossible.
For more information on Unicode and the technical details, see the Wikipedia article on Unicode in the "In-Depth" section.
Unicode has solved most of those problems and is supported on every modern machine. In D, all strings are Unicode strings, whereas strings in languages such as C and C++ are just arrays of bytes.
The types string
, wstring
, and dstring
are UTF-8, UTF-16, and
UTF-32 encoded strings respectively. Their character types are
char
, wchar
, and dchar
.
According to the specification, it is an error to store non-Unicode data in the D string types; expect your program to fail in different ways if your string is improperly encoded.
In order to store other string encodings, or to obtain C/C++
behavior, you can use raw bytes with types ubyte[]
or char*
.
Reading the gem on range algorithms is suggested for this section.
There are some important caveats to keep in mind when working with Unicode in D.
First, as a convenience, when iterating over a string
using the range functions, each element of
string
s and wstring
s is converted into a UTF-32 code-point as each item.
This practice, known as auto decoding, means that
static assert(is(typeof(utf8.front) == dchar));
This behavior has a lot of implications, the main one that
confuses most people is that std.traits.hasLength!(string)
equals False
. Why? Because, in terms of the range API,
string
's length
returns the number of elements in the string,
rather than the number of elements the range function will iterate over.
From the example, you can see why these two things might not always
be equal. As such, the range algorithms act as if string
s
do not have length information.
For more information on the technical details of auto decoding, and what it means for your program, check the links in the "In-Depth" section.