Wednesday, April 04, 2012

Marshaling UTF-8: Harder than it ought to be.

A few days ago, I was looking at the codeplex project libspotifydotnet, published by a friend of mine, Scott. The project is a managed wrapper around Spotify's cross-platform API which, of course, is written in C.

Some of you may know Scott as the author of Jamcast, a super-awesome media server that lets you play your local media library over all kinds of UPnP devices: game consoles, PCs, purpose-built DLNA devices, even an Android smartphone. Some of the most interesting interop work I've ever done has been to support Jamcast, including a fasincating dive into JNI and the Android bionic runtime. So, I was not surprised that this particular interop API yielded some more interesting stuff.

In this case, the wrapper around libspotify brought up a couple of issues that are commonly found in third-party libraries, particularly cross-platform ones, but rarely arise when you're doing Windows P/Invoke. The one I want to focus on today is the string handling: libspotify strings are UTF-8 encoded, and if you're not careful, this case wreak havoc on your interop code. So, lets see how to be careful.

Needs More IntPtr

The libspotifydotnet code consists almost entirely of function imports (there's a handful on structs, enums, and delegates) that look like this:
public static extern IntPtr sp_playlist_track_message(IntPtr playlistPtr, int index);

public static extern IntPtr sp_playlist_name(IntPtr playlistPtr);

public static extern sp_error sp_playlist_rename(IntPtr playlistPtr, IntPtr newNamePtr);
You should immediately notice the problem here: almost every parameter and return value is an IntPtr! When interoperating with Windows itself, IntPtrs are usually a sign of a very poorly-thought-out interop wrapper; C# and the P/Invoke runtime gives us a lot of choices in how to declare our elements correctly. A lot of that, however, relies on the fact that Microsoft implemented their API in a very P/Invoke-compatible way. (To be more accurate, they implemented P/Invoke primarily to target their API). When dealing with third-party code, and especially cross-platform code, our options tend to be much more limited.

Part of the problem is that libspotify hides most of its key types internally, exposing them to the API as opaque pointers only. In the C header, these would be defined as empty structures, e.g.
typedef struct sp_track sp_track;
typedef struct sp_playlist sp_playlist;
C permits us to do this, as long as we never try to use the type as anything other than a pointer. If we try to take the sizeof() an sp_track or read fields from sp_playlist, then we'll get an error. This is a very common technique in third-party libraries, where you aren't supposed to mess with the contents of their structures. Since all we get are pointers, the best we can do is IntPtr.

The bigger problem, though, is that some of those IntPtr values are string. For example, the API header has this:
const char * sp_playlist_name(sp_playlist *playlist)
As you can see, it was translated using IntPtrs for both parameter and return value. Why not using a string? Because that char * points to a set of characters that P/Invoke doesn't know how to handle: UTF-8 encoded data. Unfortunately, while the modern development community (including Microsoft) has mostly settled on UTF-8 as the "correct" choice for encoding string data, Windows is still carrying around its pre-Unicode baggage, and here's where it shows.

Wide vs. Multi-Byte vs. Single-Byte Character Sets

The handling of character sets on Windows has a long and complex history, dating way back to the original IBM mainframes and their use of code pages -- actual tables of glyphs that represented a set of characters. I'll try to summarize the key points here, but I strongly recommend you go read Joel Spolsky's excellent treatise on the matter first. I'll wait.

Thanks to its lineage as an MS-DOS derivative, Windows text encoding is saddled with a need to support a number of obsolete standard, pseudo-standard, and proprietary systems for encoding character information. Fortunately, since the introduction of UTF-16 into Windows 2000, those have all be replaced with a stable encoding system based on Unicode, but the legacy cruft is still there.  (For the nit-picky: Windows NT 3.x and 4 used UCS-2, an older standard that is a subset of UTF-16 with more a limited character set.)

Windows uses what Microsoft still calls "code pages", but would more properly be called character encodings, to deal with the encoding of text information as numeric data. Since Windows uses UTF-16 internally, the code page's primary job is to map legacy character data back to Unicode code points for internal processing and display, or vice versa for serializing to disk or network.

When Windows is installed, you select a default Windows code page based on your native language. (Microsoft stopped calling these "ANSI" code pages years ago, probably because they got sick of explaining why they had nothing to do with ANSI despite their name.) For most Western European users, this will be code page 1252 (often mistakenly called ISO-8859-1, a similar but not identical ISO standard), the Latin 1 code page. This encoding maps 256 characters from various European languages to a single 8-bit value. In Microsoft parlance, this is a single-byte character set.

To support languages with more than 256 characters, some code pages are multi-byte character sets (as far as I know, "multi" here is never larger than "double"). Again, this term is somewhat misleading; these are more accurately called variable-length encodings, but the term "multi-byte" is ingrained into the Windows API. In a multi-byte code page, certain characters are encoded as a single byte, typically those that overlap with ASCII, while the rest are encoded as two bytes. In order to properly decode such a string, you need to start from the beginning and read byte-for-byte. A single byte value may appear as both the first or the second byte in a sequence. This makes things like indexing or extracting substrings much slower than it has to be. (UTF-8, as an example, does not have this problem, but it makes the encoding slightly bigger.)

To avoid this problem, the Windows NT line of operating systems introduced an alternative encoding scheme, taking advantage of the then-newly-developed Unicode standard, in which all characters were encoded as two bytes. By Windows 2000, Microsoft had settled on the UTF-16 encoding for all internal strings, which is how things have remained to this day. To support this, it also takes advantage of a new feature from the C90 standard, the wide character type, wchar_t. The size of a wchar_t, unlike char, is implementation-defined; in Windows this is a 16-bit value and almost universally contains UTF-16 data. (On many Unix systems, particularly Linux with glibc, wchar_t is a 32-bit value, encoded with UCS-4/UTF-32.)

While wide character strings (usually written as LPWSTR) made life easier for Windows developers, using UTF-16 to store or transmit text information can be a significant waste of space. Data that could be represented in a single-byte code page takes up double the space when encoded as UTF-16. As a result, most file and network information was still encoded using the legacy code pages. More importantly, the desktop OSes in the Windows 95 line did not support UTF-16, but they did support multi-byte code pages. To help here, the Windows API features introduced around this time typically had both "ANSI" versions, which used the system's current code page, and "Wide" versions, which used UTF-16. For applications that needed to deal with both, Microsoft introduced the WideCharToMultiByte and MultiByteToWideChar API calls.

If you look at the definition for those functions, you may start to get an idea why this is such a problem for P/Invoke functions. The first parameter to the conversion functions specifies which code page to use on the multi-byte side of the equation, but you can't just pass any code page you want. Primarily, you end up using one of two options here: CP_ACP, the "current Windows code page", and CP_UTF8. Starting with Windows 2000, UTF-8 is one of the available multi-byte code pages, but it's handled a bit differently from the others. Unlike the double-byte code pages, UTF-8 characters can take up to four bytes to encode, so they cannot be used where other code pages can be. In particular, you cannot set them as the active code page (setlocale will fail if you try). So, while Microsoft strongly encourages developers to use UTF-8, and makes it easy to do so explicitly, you cannot configure Windows to do so automatically.

UTF-8 Data And P/Invoke

With that background out of the way, lets see how this all plays out during our interop code. To start with, we'll work up a very simple native C DLL, with a single exported function. This function returns a constant string, encoded as UTF-8. To demonstrate the various ways that P/Invoke handles strings, we're going to define three different interop methods with the same entry point.
[DllImport("nativelib", EntryPoint = "pinvoke_name", CharSet = CharSet.Ansi)]
[return: MarshalAs(UnmanagedType.LPStr)]
private static extern string pinvoke_name_a();

[DllImport("nativelib", EntryPoint = "pinvoke_name", CharSet = CharSet.Unicode)]
[return: MarshalAs(UnmanagedType.LPWStr)]
private static extern string pinvoke_name_w();

[DllImport("nativelib", EntryPoint = "pinvoke_name")]
private static extern IntPtr pinvoke_name_ptr();

A cross-platform library like libspotify most likely uses a Unicode support library, of which icu is the most popular, but we'll just use the API to get our UTF-8 string. First, lets try an implementation that matches what most third-party libraries are probably doing.
const wchar_t *data = L"From Α to Φ";
char *utf8 = (char*)malloc(30);
memset(utf8, 0, 30);

WideCharToMultiByte(CP_UTF8, 0, data, -1, utf8, 29, NULL, NULL);

return utf8;
This fails pretty spectacularly, and for reasons completely unrelated to the character set. The problem here is that malloc; whenever the P/Invoke code sees a string return value, it copies the unmanaged data into a managed string, then calls CoTaskMemFree on the original pointer. But our memory wasn't allocated by the COM allocator, so that free attempt blows up.

That alone is reason enough to force us to use IntPtr return values, but lets look deeper anyway. Lets change our native code to use the shared allocator that .NET expects, like a Windows-based library probably would:
const wchar_t *data = L"From Α to Φ";
char *utf8 = (char*)CoTaskMemAlloc(30);

WideCharToMultiByte(CP_UTF8, 0, data, -1, utf8, 29, NULL, NULL);

return utf8;
This gets us further, but not much. If we call either of our first two methods, and allow P/Invoke to produce a string for us, the results are not encouraging:
From Α to Φ
Clearly, the wide-character version is wrong; the data we got back is some meaningless bytes and what looks like parts of another string. (The "?" characters are put there by the Unicode "best fit" algorithm when it sees invalid code points, which happens all the time when you mix up encodings.) The "Ansi" version is closer, close enough that we may not even notice that it's broken! But since we happen to have a few non-Latin characters in our string, we can see where things go bad. What should have been a pair of Greek capital letters came out as two pairs of Latin-1 characters.

The problem here is that P/Invoke needs to convert our string data from a multi-byte character set, as indicated by the UnmanagedType.LPStr attribute, to the wide character set. To do this, it calls MultiByteToWideChar, as we'd expect, but it always passes a code page of CP_ACP. Since the active code page for my machine is windows-1252, a single-byte encoding, the two-byte UTF-8 sequences get translated into "Α" and "Φ". The "Î" character in windows-1252 is mapped to one of the bytes used by UTF-8 to indicate a two-code-unit sequence, so seeing "Î" followed by a second character is another common indicator that you got the encoding wrong.

So, its clear we're gonna have to use IntPtr to get the string data out; lets see how easy that is. Maybe we can get the Marshal class can do all the work for us:
var ptr = pinvoke_name_ptr();
var sptr = Marshal.PtrToStringAnsi(ptr);
Unfortunately, this gives us the same results as before. This shouldn't surprise us all that much; the P/Invoke code uses these same method calls at runtime to automatically marshal strings. If it was that easy, P/Invoke would just get it right for us. It looks like we're going to have to do the work ourselves.

To get our data, we'll need to first extract the data out of the IntPtr into a managed byte array. If we happen to know how long our string is (perhaps it's an output parameter in our API, or we knew ahead of time) we can bulk-copy it using Marshal.Copy(). If you do this, be sure not to include the terminating null character in your byte array, or you'll have one in your final string as well. If we don't know exactly how long our string is, we extract individual bytes from the pointer until we read the the null terminating byte. Either way, once we have our UTF-8 data in a byte array, the Encoding.UTF8.GetString() method will do the rest:
var data = new List<byte>();
var ptr = pinvoke_name_ptr();
var off = 0;
while (true)
  var ch = Marshal.ReadByte(ptr, off++);
  if (ch == 0)
string sptr = Encoding.UTF8.GetString(daa.ToArray());
Finally, we get the string we're expecting. Not the most efficient way to go about it, but you get the point.

To go the other way, and pass a UTF-8 string into an unmanaged library, we essentially reverse the process. Notice here that I'm explicitly leaving an extra 0 byte at the end of the buffer. Encoding.GetBytes will not automatically do this for you, but C will definitely expect it.
var s = "From Α to Φ";

var bytes = Encoding.UTF8.GetByteCount(s);
var buffer = new byte[bytes + 1];

Encoding.UTF8.GetBytes(s, 0, s.Length, buffer, 0);
var ptr = Marshal.AllocCoTaskMem(bytes + 1);
Marshal.Copy(buffer, 0, ptr, bytes);

There's one more problem to deal with here: this implementation is leaking memory like a sieve. We are allocating memory for our string on every call, and its never being freed. When its memory we allocated in managed code, this is easy: call Marshal.FreeCoTaskMem after we're done with it. But what if the native code is the one doing the allocation, as in our prior examples?

How to handle that depends entirely on the third-party library itself, but there are a couple of common patterns. They all share the same basic idea: whichever native allocator you use to allocate your character buffer, you must use the same allocator to free it. Your basic options are:
  1. Do nothing. In the libspotify case, for example, the strings are managed entirely by the API; consumers of their API are expected not to free the strings they get back. This is very common when using opaque pointers.
  2. Use one of the Windows API allocators in your native code, and the corresponding free method in C#.  If the native code used LocalAlloc or CoTaskMemAlloc, we can use Marshal.FreeHGlobal or Marshal.FreeCoTaskMem, respectively, to free that memory. Microsoft generally encourages the use of the COM shared allocator for this purpose, which is why P/Invoke automatically tries to use that to free memory.
  3. Use explicit free methods in the native API. For example, if we want to continue to use malloc in our native code, we would need a pinvoke_free() method that called free(); or we could use new[] and a corresponding delete[]. The ffmpeg library, for example, behaves this way. 
When writing interop code that I know needs to work with C#, I almost universally go with LPWSTR and CoTaskMemAlloc specifically to avoid the problems we've seen here. But with third party libraries, we have to use whatever data encoding we have. Hopefully now you'll be better equipped to deal with UTF-8 as easily in unmanaged code as we do with managed.


alnoor said...

Great post! Thanks for all the gritty details.

Anonymous said...


hakuaika said...

Thanks a lot! Is it possible that in the last code listing it must be

Marshal.Copy(buffer, 0, ptr, bytes + 1);

instead of

Marshal.Copy(buffer, 0, ptr, bytes);


Mika said...

Excellent blog post, many thanks!