16 Feb 2006

NSString, Apple Events and Intel Macs


yesterday, i dropped by the london apple store to test something that had been broken on Intel Macs for the Album Art Widget, and also the EyeTunes framework. the port to universal was quite smooth as i had read through all the documentation and made sure that i didn't do anything that i had to do any byte swapping myself.

so imagine the surprise i had when i started getting bug reports about it not working on the intel mac. i scratched my head a couple of times and could see where the problem might be, but after reading and re-reading the unversal binary programming guidelines, i couldn't figure out what i had done wrong.

after some testing by commandeering one of the imacs, i put on some test programs to test Apple Events. these are basically what Apple Script uses under the hood. to cut the long story short, the bug appears that on Intel Macs, despite what Apple tells you in this paragraph in the Byte Swapping Strategies section:


Mac OS X manages system-defined Apple event data types for you, handling them appropriately for the currently executing code. You don't need to perform any special tasks. When the data that your application extracts from an Apple event is system-defined, the system swaps the data for you before giving the event to your application to process. You will want to treat system-defined data types from Apple events as native endian. Similarly, if you put native-endian data into an Apple event that you are sending, and it is a system-defined data type, the receiver will be able to interpret the data in its own native endian format.



emphasis is mine. i read that as any Apple Event type that was not defined by me, should be native endian, eg, on Intel Macs it would be Little Endian and PPC Macs, it would be Big Endian. WRONG! bzzzt.

it turns out that it is nearly true except for receiving typeUnicodeText (or 'utxt' for those who use the 4 character type codes apple loves). typeUnicodeText is basically a UniChar that is a 16 bite wide char value holding, you guessed it, unicode. turns out that the unicode inside is big endian regardless of architecture.

the kicker here is i then plug the values of this string into [NSString initWithBytes:length:encoding:], and use the NSUnicodeStringEncoding, it reads it in as UTF16-LE (little endian).

as far as i can see, NSString's interpretation is right. if you feed it what is allegedly UTF-16 without designating what byte order it is (eg. not using UTF-16LE mad UTF-16BE), then it should use whatever is the native byte order. even apple's own documentation later on about unicode text files in the same page explains this:

The constant typeUnicodeText indicates utxt text data, in native byte ordering format, with an optional BOM. This constant does not specify an explicit Unicode encoding or byte order definition.



so it should of extracted the text out in native byte ordering, not big endian. for those who want to see the code plus the workaround, here it is:


err = AEGetParamPtr(replyEvent, keyDirectObject, typeUnicodeText, &resultType,
replyValue, resultSize, &resultSize);
if (err != noErr) {
ETLog(@"Unable to get parameter of reply: %d", err);
goto cleanup_reply_and_tempstring;
}

// workaround unexpected big endian return valu
// alternative workaround would be to add the BOM for big endian in.
for (i = 0; i < resultSize/2; i++)
replyValue[i] = CFSwapInt16BigToHost(replyValue[i]);

replyString = [[[NSString alloc] initWithBytes:replyValue
length:resultSize
ncoding:NSUnicodeStringEncoding] autorelease];



although far from being definitive, this is what you get with python on a PPC machine:


>>> water = "\346\260\264"
>>> # water is the UTF-8 representation of the character "water" in chinese.
>>> water.decode('utf-8')
u'\u6c34'
>>> water.decode('utf-8').encode('utf-16')
'\xfe\xffl4'
>>> water.decode('utf-8').encode('utf-16le')
'4l'
>>> water.decode('utf-8').encode('utf-16be')
'l4'
>>> water.decode('utf-8').encode('utf-16be').decode('utf-16')
u'\u6c34'
>>> water.decode('utf-8').encode('utf-16le').decode('utf-16')
u'\u346c'


notice in python, if you do not specify the flavour of UTF-16, you will get the BOM in the front. that is the \xfe\xff you see in front of 'l4'. now, notice that i use UTF-16BE which strips out the BOM (correct again) and feed it back python's character set encoding. it will assume the native byte ordering (big endian on PPC) and output the correct UCS4 unicode character. the same one as what we started with. if we feed it the little endian version, of course it comes out wrong because the characters have been byte swapped.

now moving on to an intel linux machine:


>>> water = "\346\260\264"
>>> water.decode('utf-8')
u'\u6c34'
>>> water.decode('utf-8').encode('utf-16')
'\xff\xfe4l'
>>> water.decode('utf-8').encode('utf-16le').decode('utf-16')
u'\u6c34'


now note that this time, if we don't specify the BOM, python adds a different BOM to the front (\xff\xfe). this is the BOM for little endian. contrary to the previous example, it not specifying the flavour of UTF-16, python assumes UTF-16 in little endian (native endian) all the way.

so getting back to Apple Events, what the getEvent decoder should of been doing is to convert typeUnicodeText into the NATIVE ENDIAN UniChar * rather than just default to big endian. i don't know whether iTunes runs in rosetta or natively on Intel Macs, but either way, Apple Events should be converted properly and byte swapped properly as the documentation says.

final note, the creating an Apple Event Descriptor does not exhibit this bug. that is, if you send an Apple Event with typeUnicodeText and you give it a Little Endian Unicode string, the receiver on the other side of the Apple Event gets the right unicode string. So clearly this is a bug with Apple Events under intel macs. now i would file a radar bug report if only i was worth enough to do that. apparently worthiness is deemed by how much money you can chuck at apple to get an ADC select account.


You can reply to me about this on Twitter: