Archiving a Yahoo Group

Robin Whittle rw@firstpr.com.au 2013-09-25  (Page established 2013-09-24.)
Minor update 2013-10-01 and:

Update 2014-05-29: There is a commercial program, with free trial, which enables download of some items (messages, photos) but not yet all (files, but this is being worked on) from Neoized Yahoo Groups.  I haven't checked it out: http://www.personalgroupware.com .   What I write below is probably of limited value, since as far as I know it only applies to non-Neoized accounts, and I guess all or almost all Yahoo users have had their accounts Neoized by now.

To the page with other SysAdmin items.  To the main First Principles page.

<<< 1 - yahoo2mbox

<<< 2 - grabyahoogroup

<<< 3 - Reading the Messages Mbox file with Mozilla Thunderbird, and a preliminary analysis of why some messages do not display.

0 - Introduction

There are a number of cogent reasons why someone would want to archive the contents of their Yahoo Groups, especially the messages, which constitute a record of the discussions which may go back many years:

This last point is what has happened recently for all Yahoo Groups, for some user accounts initially and perhaps now for all user accounts.  This is the Neo interface:

https://en.wikipedia.org/wiki/Yahoo_Groups#Yahoo.21_Groups_remodel_2013_.22neo.22

By all reports Neo disrupts the ability to moderate and do many other things.  For instance, with Neo, as far as I know, it is impossible to view the original email or to display the message in a fixed width font.

Update 2013-10-01: There were some glitches in the Neo status of my account and I was able to capture screenshots of the hompage of a Yahoo Group with and without Neo:

LeCroy-Yahoo-Group-non-Neo-Home.png
LeCroy-Yahoo-Group-Neo-Home.png

TekScopes-Yahoo-Group-non-Neo-Home.png
TekScopes-Yahoo-Group-Neo-Home.png

In mid-September 2013 I discovered the introduction of Neo and set to work to archive the contents of two Yahoo Groups.  I found that the Neoization had not been applied to my account yet, which enabled me to use these two Perl scripts.  I succeeded in getting most of the messages from one Yahoo Group (using yahoo2mbox) and in getting all the messages, files, attachments and photos from another (using grabyahoogroup).  On 2013-09-24 when I set about writing this page to document how I did so, I found my account had been Neoized, so I can't do any of the following any more.  I wrote the following from notes I took during the process.

Maybe these scripts will be updated to cope with Neo - but perhaps this will be impossible.

The following will only be of practical interest to someone who has the username and password of a Yahoo Group account which has not yet been Neoized.  Maybe, by now (2013-09-24) all accounts have been Neoized.  If so, then the following will be of no practical use to anyone.

All the following is written as if my account had not been Neoized.  I found two Perl scripts which are intended to archive a Yahoo Group:

1 - yahoo2mbox

This only works on messages, and while it has one advantage - a header line in each message with the Yahoo Groups message number - I used the next script instead.  My explanation of this script forms part of what is required to understand the second:

2 - grabyahoogroup

This archives Messages, Attachments, Files, Photos and the list of Member information.

(Later, I found a commercial program (with a free trial version) which is intended to be able to download messages from a Yahoo Group.  The author is working on changes to make it work with a Neoized account:

http://www.personalgroupware.com

I have not investigated this yet.)

Neither of these will work with a Yahoo Groups account which has been converted to Neo: https://en.wikipedia.org/wiki/Yahoo_Groups#Yahoo.21_Groups_remodel_2013_.22neo.22.

Both these approaches face some challenges, including:

Running on a particular computer due to their need for a variety of Perl modules AKA "Perl packages".

The Yahoo Groups server recognizing what it regards as excessive requests from a particular IP address for a particular user account.  It will block any attempt to log in or get any file for two or maybe more hours, with a "999" error message.

The server doing unexpected things, including things the script's programmer did not anticipate.


In the following explanations:

uuuu = the Yahoo Groups user name.
pppp = the password for this account.
gggg = the name of the Yahoo Group.

aaa\bbb = the base directory in which the Perl script is located and from which the MSDOS box was running when I gave the commands below.

1 - yahoo2mbox

This is the work of Vadim Zeitlin in France:

http://www.tt-solutions.com/en/portfolio/yahoo2mbox

Perl scripts are computer programs written in the Perl language: https://en.wikipedia.org/wiki/Perl. Perl is a good language for dealing with text strings and files.  Perl programs are typically interpreted - they are translated into CPU actions one line at a time, as the program runs.  To run a Perl script, the computer needs to have a Perl interpreter installed.  These may be known as a Perl runtime, Perl binary distribution or whatever.  I will refer to them as Perl interpreters.

Side note on why I didn't run these two scripts under Linux:

Linux machines frequently have a Perl interpreter already installed.  Both these Perl scripts require additional Perl modules (AKA Perl packages) in addition to the basic Perl interpreter.  The scripts generate an error message for the first of these packages which are not installed in a way the Perl interpreter can find them.  These modules do particular functions such as accessing a remote server, so the authors of these Perl scripts don't need to write code to do this.

First I tried running the Linux version of yahoo2mbox on a Debian 7 Linux machine.  It wouldn't run due to at least one missing Perl package.  When I tried to install the package with the usual cpan system, I got into all sorts of confusion.  It turns out to be difficult to install Perl packages into Debian systems, for reasons explained in this article, in which apparently well informed people debate the merits of three seemingly undesirable alternatives:  http://www.perlmonks.org/?node_id=753416 .

I tried installing Perl packages with cpan in an ages-old CentOS system and got into nearly interminable cycles of installing further packages to satisfy dependencies. 

I was able to run yahoo2mbox on a Windows XP 32 bit machine by installing a Perl interpreter from:

http://www.activestate.com/activeperl

I installed ActivePerl-5.16.3.1603-MSWin32-x86-296746.msi .  Once this is installed, typing perl and the full name of a Perl script in the current directory of an MSDOS box (Command prompt), and pressing Enter, will cause the interpreter to execute the script.

yahoo2mbox only attempts to download the messages from a Yahoo Group.  I found that I needed to restart it frequently and to do so with some thought.  It collects the messages in a single text file which is an Mbox format mailbox file:

Mbox (excuse my capitalization of this word)  https://en.wikipedia.org/wiki/Mbox contains one or more email messages, with headers, with the start of each message being delineated by a line starting with "From ".  Note the absence of a colon after this.  If a line in the email text starts with "From " or "from ", a '>' character is prepended to this.  This may not be reversed by whatever software reads the file, so it is not surprising that messages in some email systems have these extra '>' signs appearing in them.  

Mbox files are convenient in some respects.  However, for really large numbers of messages, a single file is excessively long and insertions and deletions take excessive CPU time.  An alternative storage for emails is the Maildir format, in which each message is a separate file in some specially structured directories:  https://en.wikipedia.org/wiki/Maildir .

Both these Perl scripts have usage instructions inside them.  By editing the file with a text editor, these can be found and printed.

Once the ActivePerl interpreter was installed and I had an MSDOS box running in my base directory (type C: or D: or whatever and Enter for the relevant hard drive and then cd aaa\bbb to change to the desired directory) then the following command line set yahoo2mbox to work:  

perl yahoo2mbox-0.25.pl gggg --user=uuuu  --pass=pppp

The program created and Mbox format file in the same directory, with the name being that of the Yahoo Group specified. 

It stopped a number of times for various reasons.  Sometimes it would fail to get a particular message number, reporting this on the console.  Then it would succeed with multiple messages and fail to get another.  Other times it would fail to get all messages.  When it did this, I used Control C (multiple times) to stop it.

If I restarted it in the same way as above, it would read the existing file and try to get messages which were missing from the end, so it was based on the last message number stored.

Eventually after half a day or so I got a file which contained almost all of the 8,000 or so messages.  I planned to run it again, or to run grabyahoogroup, to have another go.  However, my account was Neoized.  If I had done it a second time, I would have had two text files with largely the same messages, and hopefully no message missing from both.  Then I was going to use an excellent program, for Linux, Windows and Mac, called Beyond Compare (http://www.scootersoftware.com) to compare the two files side by side.  The missing messages would be obvious and Beyond Compare has an easy way of copying each block which is missing.  This would have made a good file of the entire set of messages. 

A major benefit of this approach is that each message had an extra line in its header such as:

X-Yahoo-Message-Num: 4665

showing the message number in the archives.  This is valuable, since some people refer to other messages via their message number.

See the section below "3 - Reading the Messages Mbox file with Mozilla Thunderbird" for how an Mbox file with thousands, even hundreds of thousands, of messages can be read with Thunderbird.

2 - grabyahoogroup

This is a much more extensive approach, written over many years by Mithun Bhattacharya:  

http://sourceforge.net/projects/grabyahoogroup/


I found this to be an excellent script.  The most up-to-date files are not in the file listed on the above page.  To get the latest code, I had to follow various links.  These two Perl scripts are the latest (ca. 2013-09-23), and these are all I used:

http://sourceforge.net/p/grabyahoogroup/code/HEAD/tree/trunk/yahoo_group/

grabyahoogroup.pl  2013-08-17 mithun [r128] 1. Updated login form change in structure
mboxify.pl  2011-03-08 mithun [r118] Skip the mbox file if created previously while ...

As noted above, I was unable to get it running on Linux due to my inability to install the required Perl packages, including one called Crypt/SSLeay.

I was able to get it running on Windows, by using the program which was also installed as part of the Active Perl installation described above:

Start > Programs > ActivePerl 5.16.3 Build 1603 > Perl Package Manager

This may take some time at first to download the latest package information.  When this is done, here is how to use it:
  1. Click the left box on the toolbar (mouseover: View all packages).
  2. In the search bar type in SSLeay
  3. This should find a package Crypt-SSLeay.  Select this.
  4. Click the icon to the right of the search bar (mouseover:  Mark for install).
  5. Now the right green arrow two icons to the right lights up.  Click it.
  6. When all is done, exit the program

I used a text editor on grabyahoogroup.pl to find the usage instructions, which were not particularly clear, since there were no examples.

Messages

Here are the commands I used:

grabyahoogroup.pl --username uuuu --password pppp --verbose --verbose --group gggg --messages --increasing

When it stopped, which it did many times over the two days or so it took to get ~100,000 messages, I restarted it with the number of the next message to get, based on the last one it reported getting OK, and by looking into the destination directory (described below):

grabyahoogroup.pl --username uuuu --password pppp --verbose --verbose --group gggg --messages --increasing --begin 12345

Eventually it finished, and presumably ran the mboxify.pl script which was in the same directory aaa\bbb\.  The main phase of this operation involves getting messages and writing each one as a file, named, in order:

1
2
3
. . . .
10102
10103

in the directory:

D:\aaa\bbb\gggg\MESSAGES\

Some messages had been deleted from the Yahoo Group, so there was no file for these.  When the script finished, it created from these, in the same directory, a file:

gggg.mbox

For this particular Yahoo Group, this file was 403,752,757 bytes.  Initially I didn't think (see below for notes on Thunderbird) I had any program on Windows which I thought would read this, so I copied it to a Linux machine and took a look at it with the View function of Midnight Commander (http://www.midnight-commander.org). 

I think all the messages are there, but there are no message numbers in the headers of each message.  Maybe I could modify the mboxify.pl script to process the individual files again and make a single file in which each message had the message number in a header line and/or in the Subject: line and/or as the first line of the message.  The latter two would help if this archive of messages was loaded into another system with a search engine, or if the text was simply searched as a single file.

These individual files and/or the single file would be sufficient to populate the archives of a website which continued the discussions of this Yahoo Group in another system.

The really useful - and I think unique - highlighting text search function of Mozilla Firefox:

Firefox can be used to open a text file on the local machine.  It is best to have the filename extension ".txt".  Firefox will display the file it in fixed width font.  I found I could open a .txt file of 14MB OK - it just takes a minute or two to load.  However, when I tried to open a 50MB file, Firefox crashed.

It is possible to search for a piece of text using Edit > Find: and then to use the Highlight option (this is on the bottom search bar, which appears with this command) and in the entire file, all instances of the matching text will be highlighted.  Then it is possible to step back and forth through these instances with Next and Previous.

Example of highlighting multiple instances of searched for text in Mozilla Firefox

Since the file I created was way too long for most programs to use (but see below for notes on Thunderbird), I used this nifty freeware program to split it into smaller chunks, which I could edit with a text editor:

http://www.hjsplit.org/

I chose 50MB chunks and found these could be edited without any fuss using the widely used and highly regarded open source text editor for Windows, Notepad++:

http://notepad-plus-plus.org 

I don't like the tabbed approach to editing multiple documents, but Notepad++ has the ability to open a session in a separate instance of the program, by right clicking the tab itself.  Very cool.

In this way I was able to make separate text files for messages in each year.  However, I needn't have bothered, since I discovered Thunderbird was by far the best way of using this single file, which contains 100,000 messages in 403 megabytes.  See section 3 below


Attachments

I used this command line:

grabyahoogroup.pl --username uuuu --password pppp --verbose --verbose --group gggg --attachments

This created a bunch of directories, each containing one or more attachment files in:

D:\aaa\bbb\gggg\MESSAGES\

together with an index.html file there which listed the subject of the message, the date, the number of attachments, the author and a link to the message at the Yahoo Groups site itself.  That link is now, for me, reloaded by the server to a new URL with Neo interface.

Files

I used this command line:

grabyahoogroup.pl --username uuuu --password pppp --verbose --verbose --group gggg --files

This created a bunch of directories, each containing one or more files, and each with a meaningful name about the contents in:

D:\aaa\bbb\gggg\FILES\

There was no HTML or any other index file.

Photos

I used this command line:

grabyahoogroup.pl --username uuuu --password pppp --verbose --verbose --group gggg --photos

This created a bunch of directories, each containing one or more image files in:

D:\aaa\bbb\gggg\PHOTOS\

together with an index.htmlfile there which listed name of each album (one for each numerically named directory), the creator, the number of photos and the date it was last modified.

Members


I used this command line:

grabyahoogroup.pl --username uuuu --password pppp --verbose --verbose --group gggg --members

This created an index.html file in:

D:\aaa\bbb\gggg\MEMBERS\

which lists:
  • Yahoo ID.
  • Name = Yahoo Groups account name.
  • Real name.  Only a few people filled this in.
  • Age.  Only a few people filled this in.
  • Gender.  Only a few people filled this in.
  • Location.  Only a few people filled this in.
  • Email address, but not the server.  So "joe_blow@...".
in these divisions:
  • Members.
  • Moderators.
  • Bouncing.  (These are for email accounts which the Yahoo Groups email server has determined are not responding properly.)
  • Pending.
  • Banned.
Thanks Mithun Bhattacharya for this great Perl script!



3 - Reading the Messages Mbox file with Mozilla Thunderbird

Thunderbird is an open-source email client program which runs on Windows, Linux and Mac:

http://www.mozilla.org/en-US/thunderbird/

Assuming we have an Mbox file from one of the above scripts and we want to read it and/or search it, for or own purposes, there are a number of problems in doing so using a text editor, or a simple text viewing and searching system such as opening the file as a .txt file in Firefox.

Firstly, the file can be too long for these programs.  I couldn't find a text editor which would handle a 403MB file.  Mozilla can't handle such sizes either.

Secondly, while the emails are (ideally) to our eyes, relatively simple pieces of text, the way they are encoded in the file can look very different.  They could be in HTML.  Even ordinary "plain text" emails can be encoded in various ways, such as "quoted printable" which frequently places a '=' character and a newline character in the middle of words and encodes some characters in forms such as: "=C2=A0".    It is also possible for an email's body of text to be UUEncoded, which makes it look like a block of completely impenetrable gobbledygook.

These problems are resolved by loading the Mbox file into the "Local Folders" section of Firefox.  I found Firefox was perfectly happy with a 403MB 100,000 message Mbox file.  There was no problem sorting the view of this mailbox by date, by subject, by sender etc. or by viewing a list of messages which had a certain piece of text in their Sender, Recipients or Subject line etc.  For those who are unfamiliar with Thunderbird, here is an image of what it looks like searching for some text in the abovementioned lines:  Thunderbird-100k-messages.png .  However, Yahoo Group deletes the server part of all email addresses.

However, some messages did not display any text when opened.  For each such message, it is possible to view the source text of the message with View > Message Source, in which case the text we want to read might be plainly readable, or might be in HTML or "quoted-printable".  I haven't figured out what the cause of this non-display is, but quite a few of the errant messages include this in their headers:

Content-Type: multipart/alternative; . . . 

Here is how I got the Mbox file into my Windows XP Thunderbird:

The file itself doesn't need to be altered or renamed.  Any name will do.

Find your Thunderbird Profile Directory with the instructions here: ../../web-mail/Mozilla-mail/ Lets call this directory (in a Windows machine) C:\xx\yy\zz\ .

From that directory find with Windows Explorer or whatever where Thunderbird keeps its local folders AKA local mailbox files, which will be something like C:\xx\yy\zz\Mail\Local Folders\.  For instance, on my WinXP and Windows 7 machines respectively, with "xyz" and "zyx" here standing in for a pseudo-random directory name:

C:\Documents and Settings\Robin\Application Data\Thunderbird\Profiles\xyz.default\Mail\Local Folders\
  
C:\Users\Robin\AppData\Roaming\Thunderbird\Profiles\zyx.default\Mail\Local Folders\



Close Thunderbird.

Copy the Mbox file to the above directory.

Run Firefox and find this mailbox in Local Folders.

Searching seems to work, but I haven't tested it extensively.  Right click on the mailbox name and select Search, which enables searching for works in the body of the message as well as in the Subject, Sender etc. etc.  Even with multipart messages which display blankly, as noted above, and with searching for text which is actually split over different lines in the source, such as searching for
half to find some comment
when it is in the source as:
half to find some co=
mment
seems to find the message OK.


Analysis of why some messages do not display text in Thunderbird

I haven' t looked at this problem in detail.  Here are my initial impressions, comparing a message which was retrieved with grabyahoogroup (yahoo2mbox messages may have the same problem) which does not display anything in its body with Thunderbird, with the same message as sent by the Yahoo Groups system and stored in my local email system (which is our own Postfix and Courier IMAP server).

The errant messages have the following characteristics.  This is true of the individual message files in \MESSAGES\ and of the message when it is combined into the final Mbox format file.  Thunderbird reports the same text for these with View > Source.

From the good file I have text fragments in green.  From the errant file I have text fragments in red.  Some of the header lines are in a different order.  I am not sure how significant this is

- - - -

MIME-Version: 1.0

Mime-Version: 1.0 (Apple Message framework v1085)


- - - -

Content-Type: multipart/alternative;
 boundary="Apple-Mail-18--155222812"

Content-Type: multipart/alternative; boundary=Apple-Mail-18--155222812


- - - -


The first part of the body of the message:  

--Apple-Mail-18--155222812
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
    charset=us-ascii

--Apple-Mail-18--155222812 Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset=us-ascii


- - - -


At the end of the text of the first part, where "xxx" is the single word on the last line:

xxxx
--Apple-Mail-18--155222812
Content-Type: text/html; charset=US-ASCII
Content-Transfer-Encoding: 7bit

xxxx --Apple-Mail-18--155222812 Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
charset=us-ascii


- - - -


The HTML is quite different.  At the very end of it:

</body></html>
--Apple-Mail-18--155222812--

</div><div><br></div><div>xxxx</div></body></html> --Apple-Mail-18--155222812--

Note that the errant file's text is all on one line.

- - - -


This is just from looking at a single message.  It looks really messy to fix.  No-doubt I could write a C program to alter the text to resolve these problems, but it would be a lot of work to figure out what to do and to refine it to cope with the variations in likely thousands of non-displaying messages. 





.