mb2md and mb2md-2 - Converting Mbox mailbox files to Maildir format

mb2md is a Perl script which can take one or more Mbox format mailbox files in a directory and convert them to Maildir format mailboxes.  It can also convert the /var/spool/mail/uuuu mailspool file into a Maildir.   This is for Unix/Linux and is public domain.

This is the OLD page for this script.  

Please refer to: http://batleth.sapienti-sat.org/projects/mb2md/ for the latest operational version.


Maintainance of mb2md and its descendents has very kindly been taken over by Juri Haberland.  This is the sort of program you probably only want once in your life, unless you are a consultant, so I salute Juri for making his additions and offering to maintain it on an onging basis!


The rest of this page remains as an historical reference - if you just want to get on with conversion, please go straight to the above URL.

- Robin Whittle   11 October 2002



The first part of this page covers my original mb2md script.

The second part covers a better version mb2md-2 which results from enhancements by several people - to do with file dates and flags such as whether a message has been read or not.

Be sure to read right down to the bottom.  The most capable version of this script is not my latest version, but the one developed from this by Juri Haberland.  Juri's version copes with a problem which my versions do not handle: timezone data in the "From " line in the mbox mailboxes - and he has added other valuable features.

Robin Whittle  Last update below here 13 September 2002  rw@firstpr.com.au

Back to the Web-mail directory - where you will find more about Maildir is a faster and cleaner type of mailbox for an IMAP server.

Back to the First Principles site for various items of show-and-tell - such as the world's longest Sliiiiiiiiiiiinky.

Other programs which do the same job - though with less power or flexibility

There is another perl script to do Mbox to Maildir conversions, by Ragnar Kurm at:  http://home.uninet.ee/~ragnar/2md/ .

For a discussion of other programs for converting between Mbox and Maildir formats, see this part of another page here:      ../RH71-Postfix-Courier-Maildrop-IMAP/index.html#mb2md
 and the User Contributed Maildir Support section of the qmail home page, which is on the mirror sites reachable at http://www.qmail.org .




1 - mb2md

(This is the basic version of the script.  You may well find that mb2md-2 is more suitable for your needs - so read this section and then the mb2md-2 section below.  mb2md-2 has better comments too.)

The script is fully documented by its own comments, so take a look:

mb2md.txt
It runs from command line arguments only, and is quite flexible.  It is smart enough to not transfer a dummy message such as the UW IMAPD puts at the start of Mbox mailboxes - and you could add your own search terms into the script to make it ignore other forms of dummy first message.

You can also grab it gzipped: mb2md.gz .

mb2md in its current form only works within the user's directory, though I think with appropriate arguments prefixed with ../../ it could read and write files anywhere.  I understand that there are "virtual user" situations in which the user's mail directory is not at /home/{uid}/Maildir/ but somewhere like: /var/qmail/maildirs/{uid}/Maildir/ .  A small change to the script should adapt to this situation.

Please let me know if this script is useful and if you develop any modifications of it, or other scripts to drive it and so achieve things like:


File dates

The original mb2md script produces files for each message with the date of the files set to the time the script runs.  In normal operation of a mail server, each message would have the date and time of when it was written to the maildir - such as when the message was received.   Some clients, such as Microsoft's Outlook Express (which has long been so full of serious security holes that you would be mad to run it unless you like your computer being infected with viruses) apparently display the message date and/or sort on the date, using the file date of the message's file.    It seems that the IMAP protocol can report this "physical" message date which for a Maildir system is the file date and for an Mbox system is the date in the "From " line which is added at the start of each message.  These dates are typically the local date-time when the message was received.  I guess that in Mbox and Maildir systems, moving the message to another Mbox or Maildir retains that date/time.

Netscape and Mozilla ignore this physical date. When sorting on date, they sort and display the time of each message according to the date-time in the "Date: " header which came with the message - when the message sent .  (This is not always what you want, when someone sends from a computer with its date set 17 years into the future, as mine once was!)

After running the original version of mb2md, all the file dates-times are the time of conversion.  This screws up people using clients such as Outlook Express.  So it would be nice to have a modification of mb2md which made the date of each message file follow that in the "From " line of the message in the mbox.   Simon Hampton sent me a patched version of mb2md which I thought did this - but in fact it works from the "Date: " header, which is the time the message was sent.  Also, his patch does not correctly set the date of the last message file created.  That patched version (which includes his email address) is here: mb2md-date-sh.pl.txt   But see below for mb2md-2 which works properly and uses the "From " line.


 On 30 August 2001, Michael Bartlett sent me a short perl script which he created to convert an Mbox mailbox at /var/mail/xxx to a Maildir mailbox at /var/mail.old/xxxx .  See here for that script:  box2dir.pl.txt .



Here is the usage documentation from mb2md:

# Run this as the user of the mailboxes, not as root.
#
#
mb2md  MBROOT MBDIR  [DEST]
#
#
#   MBROOT       Directory, relative to the user's home directory,
#                which is where the the MBDIR directory is located.
#
#
#   MBDIR        Directory, relative to MBROOT where the Mbox files
#                are.  There are two special cases:
#
#                1 - "None"
#
#                2 - "Inbox"
#
#                If it is set to "None" then mailboxes in the MBROOT
#                directory will be converted and placed in the
#                DEST directory.  (Typically the Inbox directory
#                which in this instance is also functioning as a
#                folder for other mailboxes.)
#
#                If this is set to "Inbox" then the source will
#                be the single mailbox at /var/spool/mail/blah for
#                user blah and the destination mailbox will be the
#                DEST mailbox itself.
#
#                Except in this "Inbox" case, the MBDIR directory
#                name will be encoded into the new mailboxes' names.
#                See the examples below.
#
#                This script will not work with mailbox files which
#                contain spaces in their names.
#
#                Expect trouble if an files in MBDIR directory
#                are not proper Mbox mailbox files.
#
#                This does not save an UW IMAP dummy message file
#                at the start of the Mbox file.  Small changes
#                in the code could adapt it for looking for
#                other distinctive patterns of dummy messages too.
#
#                Don't let the source directory you give as MBDIR
#                contain any "."s in its name, unless you want to
#                create subfolders from the IMAP user's point of
#                view.  See the example below.
#
#
#   DEST         Directory relative to user's home directory where the
#                Maildir format directories will be created.
#                If not given, then the destination will be ~/Maildir .
#                Typically, this is what the IMAP server sees as the
#                Inbox and the folder for all user mailboxes.
#
#
#
#  Example
#  =======
#
# We have a bunch of directories of Mbox mailboxes located at
# /home/blah/oldmail/
#
#     /home/blah/oldmail/fffff
#     /home/blah/oldmail/ggggg
#     /home/blah/oldmail/xxx/aaaa
#     /home/blah/oldmail/xxx/bbbb
#     /home/blah/oldmail/xxx/cccc
#     /home/blah/oldmail/xxx/dddd
#     /home/blah/oldmail/yyyy/huey
#     /home/blah/oldmail/yyyy/duey
#     /home/blah/oldmail/yyyy/louie
#
# With the UW IMAP server, fffff and ggggg would have appeared in the root
# of this mail server, along with the Inbox.  aaaa, bbbb etc, would have
# appeared in a folder called xxx from that root, and xxx was just a folder
# not a mailbox for storing messages.
#
# We also have the mailspool Inbox at:
#
#     /var/spool/mail/blah
#
#
# To convert these, as user blah, we give the first command:
#
#    mb2md xyz Inbox
#
# In this case, the first argument is irrelevant - "xyz" is ignored.
#
# The main Maildir directory will be created if it does not exist.
# (This is true of any argument options, not just MBDIR = "Inbox".)
#
#    /home/blah/Maildir/
#
# It has the following subdirectories:
#
#    /home/blah/Maildir/tmp/
#    /home/blah/Maildir/new/
#    /home/blah/Maildir/cur/
#
# Then /var/spool/blah file is read, split into individual files and
# written into /home/blah/Maildir/new/ .
#
# Now we give the second command:
#
#    mb2md  oldmail None
#
# This reads the fffff and ggggg Mbox mailboxes and creates:
#
#    /home/blah/Maildir/.fffff/
#    /home/blah/Maildir/.ggggg/
#
# Now we give the third command:
#
#    mb2md  oldmail xxx
#
# Then all the mailboxes:
#
#     /home/blah/oldmail/xxx/aaaa
#     /home/blah/oldmail/xxx/bbbb
#     /home/blah/oldmail/xxx/cccc
#     /home/blah/oldmail/xxx/dddd
#
# are converted into Maildir format mailboxes in the following
# directories:
#
#    /home/blah/Maildir/.xxx.aaaa/
#    /home/blah/Maildir/.xxx.bbbb/
#    /home/blah/Maildir/.xxx.cccc/
#    /home/blah/Maildir/.xxx.aaaa/
#
# This suits Courier IMAP fine, and these will appear to the IMAP
# client as four mailboxes in the folder "xxx" within the Inbox
# folder.
#
# The final command:
#
#    mb2md  oldmail yyyy
#
# does the rest.  The result, from the IMAP client's point of view is:
#
#    Inbox -----------------
#        |
#        | fffff -----------
#        | ggggg -----------
#        |
#        - xxx
#        |   | aaaa --------
#        |   | bbbb --------
#        |   | cccc --------
#        |   | dddd --------
#        |
#        - yyyy
#             | huey -------
#             | duey -------
#             | louie ------
#
# Note that although ~/Maildir/.xxx/ and ~/Maildir/.yyyy may appear
# as folders to the IMAP client the above commands to not generate
# any Maildir folders of these names.  These are simply elements
# of the names of other Maildir directories.
#
# With a separate run of this script, using the MBDIR = "None"
# approach, it would be possible to create mailboxes which
# appear at the same location as far as the IMAP client is
# concerned.  By having Mbox mailboxes in some directory:
# ~/oldmail/nnn/ of the form:
#
#     /home/blah/oldmail/nn/xxxx
#     /home/blah/oldmail/nn/yyyyy
#
# then the command:
#
#   mb2md oldmail/nn None
#
# will create two new Maildirs:
#
#    /home/blah/Maildir/.xxx/
#    /home/blah/Maildir/.yyyy/
#
# Then what used to be the xxx and yyyy folders now function as
# mailboxes too.  Netscape 4.77 needed to be put to sleep and given ECT
# to recognise this - deleting the contents of (Win2k example):
#
#    C:\Program Files\Netscape\Users\uu\ImapMail\aaa.bbb.ccc\
#
# where "uu" is the user and "aaa.bbb.ccc" is the IMAP server
#
# I often find that deleting all this directory's contents, except
# "rules.dat", forces Netscape back to reality after its IMAP innards
# have become twisted.  Then maybe use File > Subscribe - but this
# seems incapable of subscribing to folders.
#
# For Outlook Express, select the mail server, then click the
# "IMAP Folders" button and use "Reset list".  In the "All"
# window, select the mailboxes you want to see in normal
# usage.
#
#
# This script does not recurse subdirectories or delete old mailboxes.
#
# Be sure not to be accessing the Mbox mailboxes while running this
# script.  It does not attempt to lock them.  Likewise, don't run two
# copies of this script either.
#
#
# Trickier usage . . .
# ====================
#
# If you have a bunch of mailboxes in a directory ~/oldmail/doors/
# and you want them to appear in folders such as:
#
# ~/Maildir/.music.bands.doors.Jim
# ~/Maildir/.music.bands.doors.John
#
# etc. so they appear in an IMAP folder:
#
#    Inbox -----------------
#        | music
#              | bands
#                    | doors
#                          | Jim
#                          | John
#                          | Robbie
#                          | Ray
#
# Then you should rename the source directory to:
#
#  ~/oldmail/music.bands.doors/
#
# then you can use:
#
#   mb2md oldmail music.bands.doors
#
#------------------------------------------------------------------------------
 



2 - mb2md-2

This section concerns a second version of the script, which I gave this new name to, and version numbers starting with 2.01.  Before describing what it does, here is a summary of contributions some people have made by sending me altered versions of mb2md.  
Note, it took me hours to sort through the various versions of mb2md to find out firstly exactly which lines were changed, and secondly exactly what the changes were intended to achieve - quite apart from figuring out whether the changes did what was intended.   The reason my program works and is usable is because I document things at length, even if imperfectly.  This means it is easy for some poor sod (maybe you, but probably me in the near future) to figure out what I was trying to do, and therefore where my mistakes are.

Those who do quick and dirty changes at the usual low doumentation standard of too many computer programmers (as I am sometimes tempted to do as well) make it hard for anyone - themselves and me - to get reliable results from their work  Simon Hampton did, however, document his changes with interspersed comments and with extra comments which helped me understand some code I copied.  Mark Lai sent me a highlighted Word file to show me which lines were changed.

Simon Hampton

As noted above his patched version of mb2md reads the date in the "Date: " line which is the SMTP header created by the sending client program, with the date-time and timezone for when the message was sent.   It then uses this to set the date-time of the message's file in the Maildir, using GNU touch, which is very flexible about the date formats it will handle.  A bug in his patch of mb2md meant that the last message processed did not have its date set correctly.

This "Date: " date-time is not what I think we want.  We want to use the time which is at the start of each message in the Mbox mailbox.

Here is an example of the headers and first few lines of a message from a UW-IMAP Mbox mailbox (I changed addresses to stop them being found by spammers):

From dduck@test.org  Wed Nov 24 11:05:35 1999
Return-Path: <dduck@test.org>
Received: from prussian-caravan.cloud9.net (prussian-caravan.cloud9.net [168.100.1.4])
        by localhost.localdomain (8.9.3/8.9.3) with ESMTP id LAA03277
        for <rw@firstpr.com.au>; Wed, 24 Nov 1999 11:05:33 +1100
Received: by prussian-caravan.cloud9.net (Postfix, from userid 54)
        id CBEC0763A6; Tue, 23 Nov 1999 19:06:46 -0500 (EST)
To: rw@firstpr.com.au
From: dduck@test.org
Subject: Confirmation for subscribe postfix-announce
Reply-To: dduck@test.org
Message-Id: <19991124000646.CBEC0763A6@prussian-caravan.cloud9.net>
Date: Tue, 23 Nov 1999 19:06:46 -0500 (EST)
Sender: dduck@test.org
Status: RO
X-Status: A
X-Keywords:

--

Someone (possibly you) has requested that your email address be added
to or deleted from the mailing list . . . . .

Simon's code used the date-time in red but I want to use the date-time in green - this "From " line is not part of the message itself, it is a line added by the IMAP server so it can find the start of each message in the Mbox mailbox, and so it can know the time it was received.

Typically the date found in the "From " line is the date-time when the previous IMAP server (the one which generated the Mbox mailboxes we are converting) received the message.  This may have nothing at all to do with the date the message was sent, according (as best the sending email client can know) to the "Date: " header which came as part of the message.  

Mark Lai and Alex So

Based on Simon Hampton's patched version, they added a feature which they told me works with the clients "Microsoft outlook 98, express 5 & 2000", when migrating from UW IMAP to (I assume) Courier IMAP.  The feature is to determine whether the message had been read by the user and to convey this to the message in the Maildir mailbox.   The way it worked was to read the headers of the message looking for a line:
Status: RO
and if so, add the following at the end of the name of the message in the Maildir:
:2,S
The ":2," means the start of IMAP server flags (at least for Courier) and the flag "S" means that it has been read or rather seen .   But looking at the code, I am not convinced this patch works properly, since a flag for whether a message has been seen or not is set to 0 before the processing loop, may be set to 1 if a message has been "seen" (and should then cause the ":2,S" to be added to the filename) but there is no mechanism to set it back to 0.  So I expect this patch would flag every message after the first "seen" one as being read.  

Since I am not running UW IMAP at present and am not actively using Outlook or Outlook Express I do not have an easy way of testing this patch.  However I looked at my old Netscape-UW-IMAP Mbox mailboxes and found they used the "Status: RO" line - as shown in the message fragment above.

What is the "O" in the above Status line?

David Khoury

David's patched version of mb2md uses the HTTP::Date module by Gisle Aas: http://search.cpan.org/doc/GAAS/libwww-perl-5.64/lib/HTTP/Date.pm (Search http://search.cpan.org/ for the module "HTTP::Date") to read the "Date: " header in the message to set both the filename and date of the message in the Maildir.   This, I think, reads the SMTP "Date: " header, taking into account its timezone offsets in the wide variety of formats the HTTP::Date module supports.  This is the date the message was sent (as much as the sending client program knew - it could have all sorts of inaccuracies about the time and its timezone).

It is my understanding that to be compatible with the way Courier IMAP and presumably others work, that the file date-time should be the date-time when the message was received.   So this achieves the same functionality as Simon Hampton's patch and in my view this is not the way to proceed. 

Michael Best using the code of Philip Mak

Michael Best sent me a version of mb2md which integrated the guts of a separate script perfect_maildir which was posted to the Mutt mailing list (Mutt is a Unix mail client - http://www.mutt.org ) on 25 December 2001: http://www.mail-archive.com/mutt-users@mutt.org/msg21872.html .  I haven't run it and I don't have a proper copy, due to his email client wrapping lines.  If you want a copy, please contact him at <mbest (get rid of this at) emergence.com > .    Note that Michael Best's patch of mb2md, like the Philip Mak original code, put the messages in the /cur/ subdirectory of the Maildir, rather than the /new/ directory, which is where I think new additions to a Maildir should go.

The first new functionality was to build the text of the Subject: line into the message file name - which is intriguing, but not something I would do for practical purposes, since Courier IMAP doesn't use this or create messages in this form and since I generally do not manually look in the Maildir directories.

The second part of the new functionality was extensive and cleanly done (though I haven't tested it).  It looks for the following header lines and to create flags on the end of the resulting message file name to convey the same meaning.   Presumably Mutt in combination with UW-IMAP would use some or all of these flags in the left column - and presumably Courier IMAP uses those in the second column in a way Mutt understands.

Flag found in header
lines of the form:
  • Status: N
  • X-Status: N
Converted to flag at end
of message filename, of
the form:
   :2,N 
Meaning and notes
F
F
"Flagged".  
A
R
"Replied".  Netscape - UW-IMAP uses this too
R
S
"Read" = "Seen".
D
T
"Deleted" = "Trashed"(?) = "Tagged for deletion at the next IMAP Expunge".
 

Analysis

On 10 March 2002 I wanted to create a new version of mb2md which does the file date properly, based on the date-time in the "From " line, to indicate when the message was received .  I also want to add functionality to retain other information during the translation - whether it has been:
  1. "Read" or "seen".
  2. Replied to.
  3. Flagged.  (A user facility for any use they choose, I think.)
  4. Tagged for Deletion.  Though I would imagine that people would generally physically delete these message with IMAP Expunge before converting the Mbox mailboxes to Maildirs.

Note that there are evidently two ways in an Mbox mailbox in which a message could be indicated as having been "read", or "seen":
Status: R
X-Status: R
Looking at backup files of my old Netscape - UW-IMAP Mbox mailboxes I find that that system used the Status: RO approach, not the Status:  R approach.   There were "X-Status: " lines, often empty.  The only ones I found were for X-Status: A    for those which I had replied to.  I do not feel like firing up UW-IMAP to test how it generates the others.

While I understand the Michael Best / Philip Mak code was written for and tested with the Mutt client, I think that it should work fine with UW-IMAP Mbox mailboxes when Netscape and probably other programs has been used as the client.

mb2md-2-01

Here is the new script, as of 12 March 2002, as text and as a gzip file.  The usage is the same as the original version.

Please see below for patches to this version which may well be superior to my version. Although I have not tested Juri Haberland's version, I am greatly impressed by what he has done.

The new features are:
  1. File date-time is now based on the "From " line in the Mbox mailbox.  
  2. Integrate the flag changes as written by Philip Mak and Michael Best.
  3. Tidy the file up in terms of comments.
  4. I also changed the filename of each message to be of a regular length:

           7654321.000123.mbox:2,xxx

    Where "7654321" is the Unix time in seconds when the script was run and "000123" is the six zeroes padded message number as messages are converted from the Mbox file.  "xxx" represents zero or more of the flags F, R, S or T.

  5. Introduce version numbers - 2.01 etc.
Text:       mb2md-2-01.pl.txt
Gzipped: mb2md-2-01.perl.gz
Note that the new version requires a copy of GNU "touch" on the system - or some other version of touch which can handle the date-time formats as found in the "From " line, such as: "Wed Nov 24 11:05:35 1999".  The script expects this to be at /bin/touch so you should alter one line in it if your version of touch is somewhere else.

Patched versions from other people

Juri also observes that my original script does handle spaces in mailbox file names, provided the name is given in quotes.  I haven't tested any of this, but it looks like he knows what he is doing!
 
Also of potential interest is his FAQ on the ext3 journaled file system and how it closely relates to ext2.  http://batleth.sapienti-sat.org/projects/FAQs/ext3-faq.html    Salute!


Bug reports, patches and requests for changes

I will continue to maintain this script, although I haven't got any practical use for it since I changed to Courier IMAP in July 2001.

Please let me know if this script works OK or otherwise, and send me any scripts you develop to drive this one to do mass migrations of many users and their multiple mailboxes.

I don't promise to add any new features - but will accept suggestions and patches from others if they look useful.  

If you send me a new version of this script, I insist you:
  1. Send it as a .gzip file, or via reference to a web site, or as an attachment - since text can be wrapped in emails.
  2. Clearly state what new functionality you have added what sort of situation this is intended to help with.
  3. Clearly show with easily searched for comments (such as your initials at the start and end of the modified sections) in your file which lines you have deleted, changed or added.
  4. Clearly and fully comment your code with English sentences, with proper capitalisation and punctuation. This is the only way I can write code which works - or at least it is vastly easier than trying to do with without proper comments - so be clear and make it easy on me and yourself.



Update history.