String Builder

System.Text represents a mutable string. Most popular usage of string builder is to concatenate strings. To clear contents of string builder, instantiate or set its length to zero.

Functions

Method Description
Append
Insert
Remove
Replace
AppendLine Appends and adds a newline sequence
AppendFormat Accepts composite format string
Length Length of string
StringBuilder sb = new StringBuilder();

for (int i = 0; i < 50; i++) sb.Append (i + ",");

// To get the final result, call ToString():
Console.WriteLine (sb.ToString());

sb.Remove (0, 60);		// Remove first 50 characters
sb.Length = 10;			// Truncate to 10 characters
sb.Replace (",", "+");	// Replace comma with +
sb.ToString().Dump();

sb.Length = 0;			// Clear StringBuilder

Text Encodings and Unicode

A character set is an allocation of characters each with a numeric code or code point.

Unicode

Has address space of approx 1 million characters.

Covers most spoken world languages, historical languages and special symbols.

ASCII

Set of first 128 characters of the Unicode set.

Text Encoding

Maps characters from their numeric code point to a binary representation.

In .Net text encoding is important when dealing with text files or streams.

A text encoding can restrict what characters can be represented as well as impacting storage efficiency.

Two categories of text encoding in .NET:

1. Map Unicode characters to another character set

2. Use standard Unicode encoding schemes

Obtaining an Encoding Object

The Encoding class in System.Text is the common base type for classes that encapsulate text encoding.

The easiest way to instantiate a correctly configured class is to call Encoding.GetEncoding.

Encoding Name Static Property on Encoding
UTF-8 Encoding.UTF8
UTF-16 Encoding.Unicode
UTF-32 Encoding.UTF32
ASCII Encoding.ASCII
// The easiest way to instantiate a correctly configured encoding class is to 
// call Encoding.GetEncoding with a standard IANA name:

Encoding utf8 = Encoding.GetEncoding ("utf-8");
Encoding chinese = Encoding.GetEncoding ("GB18030");

utf8.Dump();
chinese.Dump();

// The static GetEncodings method returns a list of all supported encodings:
foreach (EncodingInfo info in Encoding.GetEncodings())
	Console.WriteLine (info.Name);

Encoding for file and stream I/O

The Encoding object controls how text is read and written to a file or stream. UTF-8 is the default text encoding for all file and stream I/O.

System.IO.File.WriteAllText ("data.txt", "Testing...", Encoding.Unicode);
//writes Testing... to  file called data.txt in UTF-16 encoding

Encoding to byte arrays

Encoding object can be used to and from a byte array. The GetBytes method converts from string to byte[] with the given encoding. The GetString converts from byte[] to string.

byte[] utf8Bytes  = System.Text.Encoding.UTF8.GetBytes    ("0123456789");
byte[] utf16Bytes = System.Text.Encoding.Unicode.GetBytes ("0123456789");
byte[] utf32Bytes = System.Text.Encoding.UTF32.GetBytes   ("0123456789");

Console.WriteLine (utf8Bytes.Length);    // 10
Console.WriteLine (utf16Bytes.Length);   // 20
Console.WriteLine (utf32Bytes.Length);   // 40

string original1 = System.Text.Encoding.UTF8.GetString    (utf8Bytes);
string original2 = System.Text.Encoding.Unicode.GetString (utf16Bytes);
string original3 = System.Text.Encoding.UTF32.GetString   (utf32Bytes);

Console.WriteLine (original1);          // 0123456789
Console.WriteLine (original2);          // 0123456789
Console.WriteLine (original3);          // 0123456789

UTF-16 and surrogate pairs

A string’s Length property may be greater than its real character count.

A single char is not always enough to fully represent a Unicode character.

Two-word characters are called surrogates.

Methods to check for surrogates:

bool IsSurrogate (char c)
bool IsHighSurrogate (char c)
bool IsLowSurrogate (char c)
bool IsSurrogatePair (char highSurrogate, char lowSurrogate)

The StringInfo class in the System.Globalization namespace also provides a range of methods and properties for working with two-word characters.

int musicalNote = 0x1D161;

string s = char.ConvertFromUtf32 (musicalNote);
s.Length.Dump();	// 2 (surrogate pair)

char.ConvertToUtf32 (s, 0).ToString ("X").Dump();			// Consumes two chars
char.ConvertToUtf32 (s[0], s[1]).ToString ("X").Dump();		// Explicitly specify two chars

Leave a Reply

Your email address will not be published. Required fields are marked *