Office Open XML

11/23/2005 3:24:00 AM

Office Open XML

Microsoft, in case you haven’t heard, just announced that they are submitting Office 12 XML to the ECMA standards body for standardization after feeling pressure from European Union. ECMA will produce the official documentation for the unencumbered use by all, but Microsoft will retain exclusive ownership of format.

This is the same procedure used by the C# standard, although I do believe that the standards body in that case actually forced Microsoft to make a few changes such as disallowing null method invocation. This submission to a standards body also mirrors the approach used by Adobe in standardizing PDF, which have been accepted by certain goverments, notably Massachusetts and the EU, as open.

Microsoft once had the dream (“Word Everywhere”) of making its formats pervasive on the Internet, not just the intranet, with the mere belief that simply building and distributing free Office document viewers would do so in the same manner as Acrobat Reader. Such goals, however, went unrealized because of the closed nature of the formats. Potential users needed to contact Microsoft to secure access to file format documentation.

If Microsoft had opened up their formats earlier, maybe its formats could have been somehow embedded into the fabric of the Web just as PDF. Maybe now it can be. One thing seems inevitable: more documents will be created by third-parties than by Microsoft software.

After a quick browse around the blogosphere, perennial naysayers are still offering their own negative spin to the announcement: (1) The formats are not truly open, since only Microsoft can extend them, and (2) Microsoft is forcing customers to deal with two document standards instead of one, OpenDocument.

My Own Tangle Wth Microsoft Formats

I previously looked at reading and writing Word binary documents from my own application. Available documents on the web were outdated (current as of Office 97) and access to the latest formats required a trip to Microsoft’s legal department.

I did manage to crack through OLE compound documents container and extract the main text and various other document records before determining that working directly with DOC files would be waste of time, especially since binary formats will become obsolete in another year and other products, such as OpenOffice, still have some difficulties with Office formats.

I also looked at some third-party vendors libraries. One company wanted me to pay a minimum of $60,000 plus substantial royalties. Others designed licenses for Intranet use and placed various restrictions that weren’t viable in mainstream application.

I decided to only read and write RTF and HTML directly. RTF has a number of advantages over DOC files: RTF is as open as Office 12 XML; retains all document features; is text-based and regularized so it is easy to parse and roundtrip; and includes public, up-to-date and full documentation. Aside from lack of XML and tool support, RTF has most of the advantages of Office 12 XML plus full support in all versions of Microsoft Word.

By reading and writing RTF, I also trivially enabled support for DOC and other document formats by calling functions to convert files in the Word object model, whenever my application detected an installed version of Microsoft Word in the system.

I will be supporting Microsoft Word’s XML format directly when it comes out.






Net Undocumented is a blog about the internals of .NET including Xamarin implementations. Other topics include managed and web languages (C#, C++, Javascript), computer science theory, software engineering and software entrepreneurship.

Social Media