17 Jan 2006

pros and cons of using preview to keyword files


i still haven't given up on my idea to tag files on my mac.

i've been noticing 10.4's preview.app's ability to keyword particular types of files such as PDF and JPEG. with JPEG it is pretty standard what it is doing. preview is adding IPTC tags to the actual file! yes, its actually doing the correct thing unlike iphoto. if you fire up preview and open an image you have, in my case, i have a photo taken at the apple store in london on new years eve :)

[sidenote: no, they don't let you countdown for new years at the apple store .. :P]

Preview Tag

so you get there by press Cmd+I. when you do save the file, look in finder, and the file has the proper spotlight keywords attribute -- as expected.

Preview Getinfo

so notice the keyword field is there, and also the email address of one of the computers in the apple store, if you were so inclined to give those internet leechers some email :) if you drag that into iphoto, iphoto actually recognises those keywords. i'm not sure whether this is an iphoto 6 development or what, but it does get imported.

checking the JPEG out with libiptcdata, it shows that the keywords are indeed in there embedded using IPTC:

Iptc Info

[note: i couldn't get the keyword information out using exiv2's command line tool, even though it claims to extract IPTC information.]

so seems like preview is doing the right thing, which is nice. so somewhere inside preview's code, it actually deals with IPTC code.

however things aren't so rosey for PDFs. i noticed the same keyword panel for pdfs, but do not use it for pdfs! if you use it for PDFs and then save the file, you will lose your table of contents and maybe some other useful data that is in the formatting. why?

because preview is using PDFKit to set the document attributes, via [PDFDocument documentAttributes]. not only does it strip the table of contents off the PDF file, if it had any, it also grows the file size by 20-50% depending on how the file was made. it seems like apple's own PDF generation is pretty lossy.

i haven't found anyone tackling this issue of preserving the PDF file contents but altering the metadata. is it that difficult to do? there are some ways to get the metadata out, such as using libextractor (i only realised because there is a security vulnerability for exactly this feature!) but no one seems to have written an libexif or libiptcdata type library to edit the data inside but preserve the general structure of the PDF.

to go back to the first paragraph. this is all just messing around. i've rediscovered the joys of extended attributes after some prodding. xattr is part of the 10.4 BSD layer, and that means we can store all sorts of goodies in there with a nice API. my new approach to tagging files will be to store the tags in the extended attributes space (allows for up to 4K of data -- plenty if you're only storing text tags and some author, title information) and then writing a spotlight importer to just simply extract those things out and put it into the spotlight database. simple. so simple that i think someone might of done it already. if not, i'm going to take a swipe at getting it working. this definitely beats dealing with mac os x's crazy mds (metadata server).

for those who want to know more about xattr, check out the ars technica description or the extremely simple xattr python module by Bob Ippolito.


You can reply to me about this on Twitter: