BitTitan Code Ninjas Improve Google Vault Migrations

At BitTitan, we do whatever it takes to make sure you have the freedom to move your data between a large variety of sources and destinations. Sometimes those systems change, however, and it is our job is to adapt.

Recently while working on exporting data from Google Vault, we encountered a number of cases where we had to respond to changes in how the data was exported and presented to us.

Google Vault can store all emails for a given Gmail account, including those that have been permanently deleted, so that a historical record of emails sent and received can later be generated.

One popular feature of Gmail is its ability to tag an email with one or many labels. This functionality is used to replace the usual organization of email into different folders, and can be much more flexible.

Unfortunately, Google Vault does not export these labels with its emails, exporting just a subset of important labels (such as INBOX, SENT, DRAFTS etc. instead).

googlevault-blog1

Emails that are labeled but not tagged with one of these labels are exported without any label information, and Bit Titan migrates them to a folder called “Archive”

googlevault-blog2

However, during the time span of just a few hours, this changed, changed back and then changed again right under our feet.

During testing of our code, magically the custom labels started to appear in the exported data, in addition to the well-known ones. Thinking this would be an excellent feature to provide to our customers (exporting the labeled email into a folder with the same name), we began to investigate.

We discovered a number of limitations here, mainly due to the format that the labels are exported in. They are placed into an XML file as a comma-separated list of labels e.g. “label1, label-2, label3”

The problems here are:

  • It is possible in Gmail to name a label with a comma as part of its name (e.g. “label1,label2”) Unfortunately there is no way to tell in the exported file that this was not an email that was tagged with “label1” AND “label2” as the string looks exactly the same
  • A similar thing happens with nested labels. Gmail allows labels to be nested beneath each other (e.g. “toplabel/middlelabel/bottomlabel”). Ideally this would be reflected in the folders that BitTitan creates at the destination. Unfortunately, the way these labels are presented in the output format are “toplabel-middlelabel-bottom-label”, and again, it is possible in Gmail to create a label that contains dashes as part of its name.

 

googlevault-blog3

 

 

googlevault-blog4

 

We decided to take a relatively simple approach, by simply splitting up the label string at each comma, and creating a folder for each element that we found.

e.g. for a label string of “label1,label2, toplabel-middlelabel-bottom-label” we would create the folders

  • label1
  • label2
  • toplabel-middlelabel-bottom-label

And not attempt to reconstruct the hierarchical nature of the labels as they appear in Gmail. We modified our code to process the labels this way and performed our basic testing, and everything looked great.

Then during a large test scenario, a rogue folder started to appear called “UNREAD” and the folders with the custom labels suddenly weren’t being created anymore. Even worse, we hadn’t changed our code since it was all working.

googlevault-blog5

So we started debugging, and discovered that the format of the labels had indeed completely changed in just a few hours. We were now back to having no custom labels exported, but there was suddenly a new “well known” label of “UNREAD”.

Maybe someone at Google had added the UNREAD label and decided to roll back the change to export custom labels ? It’s hard to tell but we hope that one day they re-appear so that we can offer this feature to our customers.

So far they are still not present, but we are ready as soon as Google is!

All of this is in addition to the export format that Google uses (an old UNIX format for mail called ‘mbox’) which doesn’t contain any folder information inside of it. This means we have to process an independent “index” file which tells us which labels belong to which email, and then when we process the emails, re-unite each one with the appropriate labels.

This format has been around since the beginning of the Internet, and exists in multiple different (and incompatible!) formats in various mail systems around the world.

The mbox format was also invented before the popular adoption of mail attachments, or non-English characters, so over the years it has been adapted to be able to handle both of these, which leads to its own challenges in processing the files accurately.

We were using a third-party tool to process the files (which are in a deceptively easy format), and all was looking great until a single email in a mailbox of 4500+ messages failed to process.

Initially we suspected it was due to a malformed mbox file that had been exported, but on further investigation it became apparent that it was almost certainly bug in the “MIME” decoder code in this tool.

Example of MIME content in an Mbox message:

googlevault-blog6

We could have attempted to single out and fix this bug, but that would leave the question as to how many more bugs that we hadn’t detected were still present in the code.

So we decided to replace this MIME parser with a much more robust one that we use elsewhere to process data. The problem was that the software we were using combined Mbox and MIME parsing in such a way that separating them out was extremely difficult.

This led to us having to re-implement the mbox parser separately, and integrate that with our own well tested and debugged MIME parser.

Once we had achieved this, the troublesome email (and entire email box) that was causing the problems processed without error, and we were once again on the hunt to eliminate anything else that could compromise the integrity of our customers migrated data.

Leave a Reply

Be the First to Comment!

avatar
  Subscribe  
Notify of