String Builder
System.Text represents a mutable string. Most popular usage of string builder is to concatenate strings. To clear contents of string builder, instantiate or set its length to zero.
Functions
Method | Description |
---|---|
Append | |
Insert | |
Remove | |
Replace | |
AppendLine | Appends and adds a newline sequence |
AppendFormat | Accepts composite format string |
Length | Length of string |
StringBuilder sb = new StringBuilder(); for (int i = 0; i < 50; i++) sb.Append (i + ","); // To get the final result, call ToString(): Console.WriteLine (sb.ToString()); sb.Remove (0, 60); // Remove first 50 characters sb.Length = 10; // Truncate to 10 characters sb.Replace (",", "+"); // Replace comma with + sb.ToString().Dump(); sb.Length = 0; // Clear StringBuilder
Text Encodings and Unicode
A character set is an allocation of characters each with a numeric code or code point.
Unicode
Has address space of approx 1 million characters.
Covers most spoken world languages, historical languages and special symbols.
ASCII
Set of first 128 characters of the Unicode set.
Text Encoding
Maps characters from their numeric code point to a binary representation.
In .Net text encoding is important when dealing with text files or streams.
A text encoding can restrict what characters can be represented as well as impacting storage efficiency.
Two categories of text encoding in .NET:
1. Map Unicode characters to another character set
2. Use standard Unicode encoding schemes
Obtaining an Encoding Object
The Encoding class in System.Text is the common base type for classes that encapsulate text encoding.
The easiest way to instantiate a correctly configured class is to call Encoding.GetEncoding.
Encoding Name | Static Property on Encoding |
---|---|
UTF-8 | Encoding.UTF8 |
UTF-16 | Encoding.Unicode |
UTF-32 | Encoding.UTF32 |
ASCII | Encoding.ASCII |
// The easiest way to instantiate a correctly configured encoding class is to // call Encoding.GetEncoding with a standard IANA name: Encoding utf8 = Encoding.GetEncoding ("utf-8"); Encoding chinese = Encoding.GetEncoding ("GB18030"); utf8.Dump(); chinese.Dump(); // The static GetEncodings method returns a list of all supported encodings: foreach (EncodingInfo info in Encoding.GetEncodings()) Console.WriteLine (info.Name);
Encoding for file and stream I/O
The Encoding object controls how text is read and written to a file or stream. UTF-8 is the default text encoding for all file and stream I/O.
System.IO.File.WriteAllText ("data.txt", "Testing...", Encoding.Unicode); //writes Testing... to file called data.txt in UTF-16 encoding
Encoding to byte arrays
Encoding object can be used to and from a byte array. The GetBytes method converts from string to byte[] with the given encoding. The GetString converts from byte[] to string.
byte[] utf8Bytes = System.Text.Encoding.UTF8.GetBytes ("0123456789"); byte[] utf16Bytes = System.Text.Encoding.Unicode.GetBytes ("0123456789"); byte[] utf32Bytes = System.Text.Encoding.UTF32.GetBytes ("0123456789"); Console.WriteLine (utf8Bytes.Length); // 10 Console.WriteLine (utf16Bytes.Length); // 20 Console.WriteLine (utf32Bytes.Length); // 40 string original1 = System.Text.Encoding.UTF8.GetString (utf8Bytes); string original2 = System.Text.Encoding.Unicode.GetString (utf16Bytes); string original3 = System.Text.Encoding.UTF32.GetString (utf32Bytes); Console.WriteLine (original1); // 0123456789 Console.WriteLine (original2); // 0123456789 Console.WriteLine (original3); // 0123456789
UTF-16 and surrogate pairs
A string’s Length property may be greater than its real character count.
A single char is not always enough to fully represent a Unicode character.
Two-word characters are called surrogates.
Methods to check for surrogates:
bool IsSurrogate (char c) bool IsHighSurrogate (char c) bool IsLowSurrogate (char c) bool IsSurrogatePair (char highSurrogate, char lowSurrogate)
The StringInfo class in the System.Globalization namespace also provides a range of methods and properties for working with two-word characters.
int musicalNote = 0x1D161; string s = char.ConvertFromUtf32 (musicalNote); s.Length.Dump(); // 2 (surrogate pair) char.ConvertToUtf32 (s, 0).ToString ("X").Dump(); // Consumes two chars char.ConvertToUtf32 (s[0], s[1]).ToString ("X").Dump(); // Explicitly specify two chars