Has Microsoft Found Its Format?

6/2/2005 11:13:32 PM

Has Microsoft Found Its Format?

I am so glad that Microsoft is close to abandoning the BIFF (binary interchange file format) that Word and Excel use for XML. It was designed for a different era, really messy to process, hard to extend and hard to retain compatibility with prior versions. The existing file format, for example, makes it difficult fo Excel to add support for more than 56 colors; for the longest time, Word was the only wordprocessor that couldn’t do nested tables. Hack upon hack, kludge after kludge.  Underdocumented, undocumented, misdocumented—the most recently available Office file formats on the Web is dated for Office 97, contains sparse information, and often isn’t even offering accurate information. Messy as it was, it required precision in its construction by ISVs. The only proper way to ensure Office file formats are saved properly by non-Office applications is through testing and analyzing the raw bytes of actual Office documents by hand.

The problem was obvious to every developer but, apparently, needed to bubble up and impact customers and product schedules to attract the attention of management. The new XML file format (and C++/CLI) gives me the confidence that Office might not actually ever require a rewrite (like Windows 9X did) to shed itself from its 1980s heritage of arbitrary limitations and poor programming.

The new file format is purely brilliant. It supports compatibility with humans. It is completely accessible to regular users in a way that HTML and XML are. Any non-developer can change the file extension to .zip and immediate double-click the file to access all the internal XML files and objects such as pictures. A hundred years from now, people will still be able to read these documents without the original applications, because the documents are pure text.

Then, I just came to the realization that it might not be over just yet. Every other version of Office introduced a new incompatible file format.

  • Office 4 required Office documents be moved to complicated compound documents (FAT file system in a file) need to support Object Linking and Embedding.
  • Office 97 introduced the last new incompatible Office file formats with the same extensions. It also introduced some other oddities. Excel supported the “Fat File Format” which combined both Excel 95/97 into one file to ease migration. Word introduced high-fidelity support for WordPerfect documents, possibly allowing them to be the default (for legal customers); this mirrors Excel strong support for Lotus files in an earlier version.
  • Office 2000 introduced full-fidelity HTML versions of the Office documents for fear that the Web would make office documents less relevant in addition to the binary versions.
  • Office 2003 again introduced new XML versions of Excel (actually introduced in Office XP) and Word file formats in addition to the other earlier two, HTML and BIFF.
  • Office 2006 will again introduce a new set of file formats (in addition to the previous three) based on a ZIP compressed archive file consisting of multiple XML files. The XML in these file are necessarily going to be different from those of 2003 in order to support new Office concepts such as parts and relationships and a new organization that separates different sections of the file and embedded objects into different files.

Will this be the last file format? It could well be, since it is elegant and has proper extensibility.

One last wish: I hope Avalon considers following in Office's footsteps and abandon any plans to use OLE compound documents (DocFiles) to store Avalon documents (which uses the .container extension). They should probably be using ZIP as well.

UPDATE (6/3): From Office Zealot, there are indications that Avalon is using a structured storage implementation based on ZIP. This could also be the same as the "12x" file format that Office 12 uses and is available for third-party applications through the System.IO.Packaging namespace in the WinFX SDK.

Comments

 

Navigation

Categories

About

Net Undocumented is a blog about the internals of .NET including Xamarin implementations. Other topics include managed and web languages (C#, C++, Javascript), computer science theory, software engineering and software entrepreneurship.

Social Media