How to fix Email transformation and indexing bug in 5.0
feb 23 2015
Categories : Notes
One of the Loftux clients recently started to experience performance problems. It all started after the upgrade to 5.0.c Community. Putting our hands on the issue we quickly noted a few things.
- They use IMAP extensively as a way to get emails into Alfresco. They only have 3 folders exposed to IMAP, once dragged into one of those, the emails get classified inside Alfresco via a user dialog, and then moved to final storage.
- The number of emails have now reached 150.000+ so quite a large number of email. But they are not in the IMAP folder permanently since they are move once classified.
- The index size on disk grew exponentially, being several times larger than the actual content store size. No matter how we resized the disk, it was never enough.
Users could pretty much enter any random combination of 3-5 characters and get a hit when doing a search. This told us that there must be something wrong with the eml (RFC822 email file) transformation to text. And sure enough, the EMLTransformer always returned the entire eml text even for multi-part emails. Since our client use MS Exchange, there is also an attachment called winmail.dat that is base64 encoded. You would see something like the text below if you transformed the email to text. The text below should only return "Some email text" when transformed, but returns everything.
Some email text
If you want to now if you are affected, create a rule on a folder that transform the email to text and drop an eml file into that folder. If you see anything similar (the internal email structure) to the text above, your system is having the same issue. If not, this means that your email was not multipart (containing both text and html version or having attachments).
I have yet to verify that there are situations where the multipart transformations do work, it may actually do that when there are certain types of emails, or different country and language codes.
The problem for our client wasn't only that the indexing got messed up, the preview of the email became a problem. When they do the final classification, the dialog shows the metadata and a preview of the document. With 5.0, this is by default via the HTML5 viewer. But since the text after transformation contained not only the email text, but also the base64 encoding of attachments and more, the preview of the email could easily be 300-500 pages long. But usually the actual email text was just the first page. You can imagine the load this puts on the server when classifying a large number of emails per day.
Loftux do maintain a build of Alfresco Community for our clients that is more up to date and where we add fixes on behalf of our clients. In this case we fixed the EMLTransformer.java, the main issue was these rows that actually reset the multi-part encoding and the email was interpreted as being just an email with text.
if (charset != null)
mimeMessage.setHeader("Content-Type", "text/plain; charset=" + charset.name());
Removing these rows made things work again. They are most likely a left over after a recent email library upgrade, needed at the time but seem to destroy the email transformation in the current version.
While we were at it we also fixed so that eml files now also transforms to html when there is html available in the email. You can find all the changes in the Alfresco Loftux community build https://github.com/loftuxab/alfresco-community-loftux. More details can be found via the issue https://github.com/loftuxab/alfresco-community-loftux/issues/22. As of this writing the fix is only available in the snapshot branch.