Friday, January 13, 2012

Hungarian Notation and Interop Code

Those of you without a lot of pre-.NET programming experience may have noticed the very bizarre names used by the various Windows SDK functions, interfaces, and types we looked at during the COM interop series. You may be wondering what, if anything, those cryptic jumbles of letters meant.

The short answer is, they are called Hungarian Notation (a reference to the proper name style used in Hungary, the native country of the developer who invented it), and it has largely fallen out of favor. You can get into some very vicious arguments with long-time developers by simply declaring your love/hate for it. Personally, I don't like to use it simply because it clashes with the current prefix-free style of C# naming conventions.  I have already mentioned my preference for translating structure names into friendlier names; I generally do the same thing for fields and parameters as well (but, somewhat arbitrarily, not interfaces or functions... see below.)

The longer answer is much more interesting, and somewhat of a tragedy, in that something with the potential to be very useful was misused to the point of meaninglessness. If you're interested in the details, read on...


The History of Hungarian

The term Hungarian Notation is, today, broadly defined to be a naming convention whereby semantic information in encoded into the name of an identifer (variable, method, field, etc.) as a short prefix of letters, each representing a specific aspect of the identifier's function.

Notice that nowhere in that description did we use the word "type", because type information was not originally part of the notation. But I'm getting ahead of myself a bit...

Hungarian notation was formally invented at Microsoft by one of their lead architects, a Hungarian-born programmer named Charles Simonyi, that oversaw, among other things, the original development of Word and Excel. As part of this process, he defined a notation that was used by those products, which has since been retroactively named Applications Hungarian. Simonyi's original intent for Hungarian Notation was to encode meaningful information about the intended purpose for a variable or parameter into the name, in order to reduce the chances of misusing that variable. Unfortunately, in his original paper, Simonyi made a crucial mistake, referring to this information as a variables "type".

Simonyi seemed to recognize the possible source of confusion, and tried very hard to explain what he meant. For example, he explained the concept of what he terms "Type Calculus", using the example of x and y coordinates. In his system, "x" and "y" would be two different types:
As suggested above, the concept of "type" in this context is determined by the set of operations that can be applied to a quantity. The test for type equivalence is simple: could the same set of operations be meaningfully applied to the quantities in questions? If so, the types are thought to be the same. If there are operations that apply to a quantity in exclusion of others, the type of the quantity is different.
[...]The point is that "integers" x and y are not of the same type if Position (x,y) is legal but Position (y,x) is nonsensical. Here we can also sense how the concepts of type and name merge: x is so named because it is an x-coordinate, and it seems that its type is also an x-coordinate. Most programmers probably would have named such a quantity x. In this instance, the conventions merely codify and clarify what has been widespread programming practice.
Under this system, "type" applies to the way we ought to be using a variable, the meaning we apply to the data it contains, and the kind of things we can do to that data. We can see some of this bleeding through into the Windows SDK, in parameters who's name begins with a "c". For example, the standard implementation of a COM enumerator interface includes a Next method typically defined as:
HRESULT Next(
  [in]   ULONG celt,
  [out]  ELEMENT *rgelt,
  [out]  ULONG *pceltFetched
;
The notations here tell us that the first and third parameters are counts of elements, and the second is a range of elements. They tell us little about the underlying data type of the parameters, because that's not what Hungarian notation is about; we know what type they are because it says so right there. In fact, the type of Hungarian used by the Applications team often has many prefixes for a single data type; pointers, for example, can be rg (range), mp (map), grp (group), or p (low-level pointer), while integers can be c (count), d (delta), rw (row), col (column), or cb (count of bytes, or size). You can see a heavy Excel influence in these names, but that's the point: the prefixes are semantically meaningful within the context of the program being written; the compiler has no idea that a row is different from a column or a count, but you do.

This notation is useful because we can now see easily when we use the parameters in an incorrect manner. For example, if I tried to allocate celt bytes of space for my rgelt parameter, it would clearly be wrong: celt isn't a size, it's a count; I need celt * sizeof(ELEMENT) bytes. Similarly, if I tried to write a success code to pceltFetched, it would clearly be wrong. A very thorough explanation of this concept can be found in this Joel Sposky article.

A few of these suggested prefixes do, in fact, encode the data type into the name. This was done because, in the C language, some data types had names that didn't always match their use. This is particular true for the char data type, which is the only 8-bit data type available in C. So, when Simonyi suggested using "b" to mean byte, and "sz" to mean zero-terminated string, it wasn't because he was trying to encode type information into the name. The "type" of those variables would probably be unsigned char and char *! He was trying to identify the set of operations you could perform on those variables: you could compare a character to 'A' but comparing a byte to 'A' was probably meaningless.(Meaningless in the sense that, to compare a byte to 'A', you have to assume it's actually a character, and thus it's not a byte anymore.)

But Simonyi's choice of the word "type", combined with an apparent overlap between data type and semantic type, would come back to haunt us. Someone else inside Microsoft, (Scott Ludwig, of the Windows 3 team, blames the documentation group, who found "real" Hungarian "too dense") misinterpreted Simonyi's paper to mean data type, and a whole new set of useless prefixes was born. These prefixes are found almost everywhere in the Windows SDK, and on MSDN, and were then spread out to the world through Charles Petzold's classic Programming Windows series, which used this new form of Hungarian exclusively.

Under the new style, each type alias defined in the SDK had a unique prefix associated with it, which produces function signatures that looked like this:
DWORD GetRegionData(
  __in   HRGN hRgn,
  __in   DWORD dwCount,
  __out  LPRGNDATA lpRgnData
);
If you follow the true Hungarian philosophy, dwCount would be a patently ridiculous parameter name: it should be cRegion. And what if I assign accidentally do dwCount = dwTotal? Is that right? And two of those parameter names are just the type name, repeated in camel case!

This new style of notation has most generously been termed Systems Hungarian, after the systems team that used it. More commonly you will hear it called things like Anti-Hungarian, "pidgin" Hungarian, or "brain-dead" Hungarian.

Applications Hungarian is a powerful tool. It is often cited as one of the key factors behind the success of Word and Excel. The increased productivity and decreased maintenance costs that came along with this notation allowed Microsoft to bring a competitive product to market with a fraction of the number of people (10's of developers vs. WordPerfect's or Lotus's 100-person teams). Unfortunately, this useful form of Hungarian was kept hidden away in source code for Word and Excel. Meanwhile, Systems Hungarian was spreading like a plague. Developers heard claims that this new notation Microsoft was using was dramatically boosting their effectiveness, and of course, we all wanted to use it. But all we had to go by was the bastardized Systems Hungarian found in the SDK and books like Petzolds.

This latter notation is what most people think of when they hear "Hungarian", except for the lucky few that use true Hungarian and often swear by it. The "braindead" version soon found itself universally scorned by high profile developers such as Bjarne Stroustrup and Linux Torvalds. It developed such a bad reputation that Microsoft's interface design  guidelines for .NET explicitly reject any form of prefix on public identifiers.

What This Means For Us

So what does all this mean, in practical terms, for our interop code?

During my introduction to COM Interop, I was careful to reuse the field and parameter names as-is when doing a translation. But, as you can probably guess from this article, I don't actually do that in production code.

Strictly speaking, we could use whatever name we wanted for every aspect of an interop definition, because the names are meaningless to the unmanaged code, with one key exception:
  • P/Invoke functions are looked up by the EntryPoint field in the DllImport attribute
  • COM interfaces are looked up by the IID in the Guid attribute
  • COM methods on IUnknown or dual interfaces are called by position in the interface.
    • Methods in a dispinterface, however, might be called by name, and must be preserved!
  • Parameters are passed based on their position in the method signature
  • Fields are laid out based on their size and offset within a structure
  • Enumerations are passed as integer values
  • User-defined types for parameters or fields are only used to decide how big they are (for marshalling purposes).
I have drawn my own somewhat arbitrary line at renaming interfaces or method names, but I routinely rename fields, parameters, and other type names to make them more friendly. So the following interface definition in IDL:
[
    object,
    uuid(0000010b-0000-0000-C000-000000000046),
    pointer_default(unique)
]
interface IPersistFile : IPersist
{
 
    typedef [unique] IPersistFile *LPPERSISTFILE;
 
    HRESULT IsDirty
    (
        void
    );
 
    HRESULT Load
    (
        [in] LPCOLESTR pszFileName,
        [in] DWORD dwMode
    );
 
    HRESULT Save
    (
        [in, unique] LPCOLESTR pszFileName,
        [in] BOOL fRemember
    );
 
    HRESULT SaveCompleted
    (
        [in, unique] LPCOLESTR pszFileName
    );
 
    HRESULT GetCurFile
    (
        [out] LPOLESTR *ppszFileName
    );
}
produces a managed C# definition that looks like this:
[ComImport]
[Guid("0000010b-0000-0000-C000-000000000046")]
[InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
public interface IPersistFile : IPersist
{
    int IsDirty();
 
    void Load(
        [MarshalAs(UnmanagedType.LPWStr)] string fileName,
        int mode
    );
 
    int Save(
        [MarshalAs(UnmanagedType.LPWStr)] string fileName,
        [MarshalAs(UnmanagedType.Bool)] bool remember
    );
 
    void SaveCompleted(
        [MarshalAs(UnmanagedType.LPWStr)] string filename
        );
 
    int GetCurFile(
        [MarshalAs(UnmanagedType.LPWStr)] out string fileName
        );
}
Of course, you are free to use or not use the Hungarian notation (either one), whichever you find most familiar. So long as you are consistent within your own code, everything else is just personal preference.


0 comments: