Archiving a Yahoo Group
Robin Whittle rw@firstpr.com.au 2013-09-25 (Page established 2013-09-24.)
Minor update 2013-10-01 and:
Update 2014-05-29: There is a commercial program, with free trial,
which enables download of some items (messages, photos) but not yet all
(files, but this is being worked on) from Neoized Yahoo Groups. I
haven't checked it out:
http://www.personalgroupware.com
. What I write below is probably of limited value, since as
far as I know it only applies to non-Neoized accounts, and I guess all
or almost all Yahoo users have had their accounts Neoized by now.
To the page with other SysAdmin items.
To the main
First Principles page.
<<< 1 - yahoo2mbox
<<< 2 - grabyahoogroup
<<< 3 - Reading the Messages Mbox file with Mozilla Thunderbird, and a preliminary analysis of why some messages do not display.
0 - Introduction
There are a number of cogent reasons
why someone would want to archive the contents of their Yahoo Groups,
especially the messages, which constitute a record of the discussions
which may go back many years:
- The Yahoo Group could disappear at any time. Yahoo might delete it.
- The
owner of the group may be incapacitated or indisposed to allowing
anyone to continue contributing to or reading the archives of the Group.
- You may get sick of depending on any company, or on Yahoo in
particular, with their advertising, for the mailing list or discussion
forum you are part of, or running, and want to run it independently on
a mailing list manager such as GNU Mailman http://www.gnu.org/software/mailman/ or
perhaps this similarly open-source project which aims to support
messages and files in a way which might be roughly equivalent to Yahoo
Groups: http://groupserver.org/groupserver/features/ .
- The way Yahoo provides access to the Group may change in a way which makes it unusable.
This last point is what has happened recently for all Yahoo Groups, for
some user accounts initially and perhaps now for all user accounts. This is the Neo interface:
By all reports Neo disrupts the ability to moderate and do many other
things. For instance, with Neo, as far as I know, it is
impossible to view the original email or to display the message in a fixed
width font.
Update 2013-10-01:
There were some glitches in the Neo status of my account and I was able
to capture screenshots of the hompage of a Yahoo Group with and without
Neo:
In mid-September 2013 I discovered the introduction of Neo and set to work to archive the
contents of two Yahoo Groups. I found that the Neoization had not
been applied to my account yet, which enabled me to use these two Perl
scripts. I succeeded in getting most of the messages from one
Yahoo Group (using
yahoo2mbox) and in getting all the messages, files, attachments and photos from another (using
grabyahoogroup).
On 2013-09-24 when I set about writing this page to document how I did
so, I found my account had been Neoized, so I can't do any of the
following any more. I wrote the following from notes I took
during the process.
Maybe these scripts will be updated to cope with Neo - but perhaps this will be impossible.
The following will only be of
practical interest to someone who has the username and password of a
Yahoo Group account which has not yet been Neoized. Maybe, by now
(2013-09-24) all accounts have been Neoized. If so, then the
following will be of no practical use to anyone.
All the following is written as if my account had not been Neoized. I found two Perl scripts which are
intended to archive a Yahoo Group:
1 -
yahoo2mbox
This only works on messages, and while
it has one advantage - a header line in each message with the Yahoo
Groups message number - I used the next script instead. My
explanation of this script forms part of what is required to understand
the second:
2 -
grabyahoogroup
This archives Messages, Attachments,
Files, Photos and the list of Member information.
(Later, I found a commercial program
(with a free trial version) which is intended to be able to download
messages from a Yahoo Group. The author is working on changes to
make it work with a Neoized account:
I have not investigated this yet.)
Neither of these will work with a Yahoo Groups account which has been
converted to Neo:
https://en.wikipedia.org/wiki/Yahoo_Groups#Yahoo.21_Groups_remodel_2013_.22neo.22.
Both these approaches face some challenges, including:
Running on a particular computer due to their need for a variety of Perl modules AKA "Perl packages".
The Yahoo Groups server recognizing what it regards as excessive
requests from a particular IP address for a particular user
account. It will block any attempt to log in or get any file for
two or maybe more hours, with a "999" error message.
The server doing unexpected things, including things the script's programmer did not anticipate.
In the following explanations:
uuuu = the Yahoo Groups user name.
pppp = the password for this account.
gggg = the name of the Yahoo Group.
aaa\bbb = the base
directory in which the Perl script is located and from which the MSDOS
box was running when I gave the commands below.
|
1 - yahoo2mbox
This is the work of
Vadim
Zeitlin in France:
Perl scripts are computer programs written in the Perl language:
https://en.wikipedia.org/wiki/Perl.
Perl is a good language for dealing with text strings and files.
Perl programs are typically interpreted - they are translated into CPU
actions one line at a time, as the program runs. To run a Perl
script, the computer needs to have a Perl interpreter installed.
These may be known as a Perl runtime, Perl binary distribution or
whatever. I will refer to them as Perl interpreters.
Side note on why I didn't run these two scripts under Linux:
Linux machines frequently have a
Perl interpreter already installed. Both these Perl scripts
require additional Perl modules (AKA Perl packages) in addition to the
basic Perl interpreter. The scripts generate an error message for
the first of these packages which are not installed in a way the Perl
interpreter can find them. These modules do particular functions
such as accessing a remote server, so the authors of these Perl scripts
don't need to write code to do this.
First I tried running the Linux version of yahoo2mbox on a Debian 7
Linux machine. It wouldn't run due to at least one missing Perl
package. When I tried to install the package with the usual cpan
system, I got into all sorts of confusion. It turns out to be
difficult to install Perl packages into Debian systems, for reasons
explained in this article, in which apparently well informed people
debate the merits of three seemingly undesirable alternatives: http://www.perlmonks.org/?node_id=753416 .
I tried installing Perl packages with cpan
in an ages-old CentOS system and got into nearly interminable cycles of
installing further packages to satisfy dependencies.
I was able to run
yahoo2mbox on a
Windows XP 32 bit machine by
installing a Perl interpreter from:
I installed
ActivePerl-5.16.3.1603-MSWin32-x86-296746.msi . Once this is installed, typing
perl
and the full name of a Perl script in the current directory of an MSDOS
box (Command prompt), and pressing Enter, will cause the interpreter to
execute the script.
yahoo2mbox only attempts
to download the messages from a Yahoo Group. I found that I
needed to restart it frequently and to do so with some thought.
It collects the messages in a single text file which is an
Mbox format mailbox file:
Mbox (excuse my capitalization of this word)
https://en.wikipedia.org/wiki/Mbox contains one or more email messages, with headers, with the start of each message being delineated by a line starting with "
From ".
Note the absence of a colon after this. If a line in the email
text starts with "From " or "from ", a '>' character is prepended to
this. This may not be reversed by whatever software reads the
file, so it is not surprising that messages in some email systems have
these extra '>' signs appearing in them.
Mbox files are convenient in some respects. However, for really
large numbers of messages, a single file is excessively long and
insertions and deletions take excessive CPU time. An alternative
storage for emails is the Maildir format, in which each message is a
separate file in some specially structured directories:
https://en.wikipedia.org/wiki/Maildir .
Both these Perl scripts have usage instructions inside them. By
editing the file with a text editor, these can be found and printed.
Once the ActivePerl interpreter was installed and I had an MSDOS box running in my base directory (type
C: or
D: or whatever and
Enter for the relevant hard drive and then
cd aaa\bbb to change to the desired directory) then the following command line set
yahoo2mbox to work:
perl yahoo2mbox-0.25.pl gggg --user=uuuu --pass=pppp
The program created and Mbox format file in the same directory, with the name being that of the Yahoo Group specified.
It stopped a number of times for various reasons. Sometimes it
would fail to get a particular message number, reporting this on the
console. Then it would succeed with multiple messages and fail to
get another. Other times it would fail to get all messages.
When it did this, I used Control C (multiple times) to stop it.
If I restarted it in the same way as above, it would read the existing
file and try to get messages which were missing from the end, so it was
based on the last message number stored.
Eventually after half a day or so I got a file which contained almost
all of the 8,000 or so messages. I planned to run it again, or to
run
grabyahoogroup, to
have another go. However, my account was Neoized. If I had
done it a second time, I would have had two text files with largely the
same messages, and hopefully no message missing from both. Then I
was going to use an
excellent program, for Linux, Windows and Mac, called
Beyond Compare (
http://www.scootersoftware.com)
to compare the two files side by side. The missing messages would
be obvious and Beyond Compare has an easy way of copying each block
which is missing. This would have made a good file of the entire
set of messages.
A major benefit of this approach is that each message had an extra line in its header such as:
X-Yahoo-Message-Num: 4665
showing the message number in the archives. This is valuable,
since some people refer to other messages via their message number.
See the section below "3 - Reading the Messages Mbox file with Mozilla
Thunderbird" for how an Mbox file with thousands, even hundreds of
thousands, of messages can be read with Thunderbird.
2 - grabyahoogroup
This is a much more extensive approach, written over many years by
Mithun Bhattacharya:
I found this to be an
excellent
script. The most up-to-date files are not in the file listed on
the above page. To get the latest code, I had to follow various
links. These two Perl scripts are the latest (ca. 2013-09-23),
and these are all I used:
grabyahoogroup.pl 2013-08-17 mithun [r128] 1. Updated login form change in structure
mboxify.pl 2011-03-08 mithun [r118] Skip the mbox file if created previously while ...
As noted above, I was unable to get it running on Linux due to my
inability to install the required Perl packages, including one called
Crypt/SSLeay.
I was able to get it running on Windows, by using the program which was
also installed as part of the Active Perl installation described above:
Start > Programs > ActivePerl 5.16.3 Build 1603 > Perl Package Manager
This may take some time at first to download the latest package information. When this is done, here is how to use it:
- Click the left box on the toolbar (mouseover: View all packages).
- In the search bar type in SSLeay.
- This should find a package Crypt-SSLeay. Select this.
- Click the icon to the right of the search bar (mouseover: Mark for install).
- Now the right green arrow two icons to the right lights up. Click it.
- When all is done, exit the program
I used a text editor on
grabyahoogroup.pl to find the usage instructions, which were not particularly clear, since there were no examples.
Messages
Here are the commands I used:
grabyahoogroup.pl --username uuuu --password pppp --verbose
--verbose --group gggg --messages --increasing
When it stopped, which it did many times over the two days or so it
took to get ~100,000 messages, I restarted it with the number of the
next message to get, based on the last one it reported getting OK, and
by looking into the destination directory (described below):
grabyahoogroup.pl --username uuuu --password pppp --verbose
--verbose --group gggg --messages --increasing --begin 12345
Eventually it finished, and presumably ran the mboxify.pl script which was in the same directory aaa\bbb\. The main phase of this operation involves getting messages and writing each one as a file, named, in order:
1
2
3
. . . .
10102
10103
in the directory:
D:\aaa\bbb\gggg\MESSAGES\
Some messages had been deleted from the Yahoo Group, so there
was no file for these. When the script finished, it created from
these, in the same directory, a file:
gggg.mbox
For this particular Yahoo Group, this file was 403,752,757 bytes.
Initially I didn't think (see below for notes on Thunderbird) I had any program on Windows which I thought would read this,
so I copied it to a Linux machine and took a look at it with the View
function of Midnight Commander (
http://www.midnight-commander.org).
I think all the messages are there, but
there are no message numbers in the headers of each message. Maybe I could modify the
mboxify.pl
script to process the individual files again and make a single file in
which each message had the message number in a header line and/or in
the Subject: line and/or as the first line of the message. The
latter two would help if this archive of messages was loaded into
another system with a search engine, or if the text was simply searched
as a single file.
These individual files and/or the single file would be sufficient to
populate the archives of a website which continued the discussions of
this Yahoo Group in another system.
The really useful - and I think unique - highlighting text search function of Mozilla Firefox:
Firefox
can be used to open a
text file on the local machine. It is best to have the filename
extension ".txt". Firefox will display the file it in fixed width
font. I found I could open a .txt file of 14MB OK - it just takes
a minute or two to load. However, when I tried to open a 50MB
file, Firefox crashed.
It is possible to search for a piece of text using Edit > Find: and then to use the Highlight
option (this is on the bottom search bar, which appears with this
command) and in the entire file, all instances of the matching text
will be highlighted. Then it is possible to step back and forth
through these instances with Next and Previous.
|
Since the file I created was way too
long for most programs to use (but see below for notes on Thunderbird),
I used this nifty freeware program to split it into smaller chunks,
which I could edit with a text editor:
I chose 50MB chunks and found these could be edited without any fuss
using the widely used and highly regarded open source text editor for
Windows,
Notepad++:
I don't like the tabbed approach to editing multiple documents, but
Notepad++ has the ability to open a session in a separate instance of
the program, by right clicking the tab itself. Very cool.
In this way I was able to make separate text files for messages in each
year. However, I needn't have bothered, since I discovered
Thunderbird was by far the best way of using this single file, which
contains 100,000 messages in 403 megabytes. See section 3 below
Attachments
I used this command line:
grabyahoogroup.pl --username uuuu --password pppp --verbose
--verbose --group gggg --attachments
This created a bunch of directories, each containing one or more attachment files in:
D:\aaa\bbb\gggg\MESSAGES\
together with an index.html
file there which listed the subject of the message, the date, the
number of attachments, the author and a link to the message at the
Yahoo Groups site itself. That link is now, for me, reloaded by
the server to a new URL with Neo interface.
Files
I used this command line:
grabyahoogroup.pl --username uuuu --password pppp --verbose
--verbose --group gggg --files
This created a bunch of directories, each containing one or more files, and each with a meaningful name about the contents in:
D:\aaa\bbb\gggg\FILES\
There was no HTML or any other index file.
Photos
I used this command line:
grabyahoogroup.pl --username uuuu --password pppp --verbose
--verbose --group gggg --photos
This created a bunch of directories, each containing one or more image files in:
D:\aaa\bbb\gggg\PHOTOS\
together with an index.htmlfile
there which listed name of each album (one for each numerically named
directory), the creator, the number of photos and the date it was last
modified.
Members
I used this command line:
grabyahoogroup.pl --username uuuu --password pppp --verbose
--verbose --group gggg --members
This created an index.html file in:
D:\aaa\bbb\gggg\MEMBERS\
which lists:
- Yahoo ID.
- Name = Yahoo Groups account name.
- Real name. Only a few people filled this in.
- Age. Only a few people filled this in.
- Gender. Only a few people filled this in.
- Location. Only a few people filled this in.
- Email address, but not the server. So "joe_blow@...".
in these divisions:
- Members.
- Moderators.
- Bouncing. (These are for email accounts which the Yahoo Groups email server has determined are not responding properly.)
- Pending.
- Banned.
Thanks Mithun Bhattacharya for this great Perl script!
3 - Reading the Messages Mbox file with Mozilla Thunderbird
Thunderbird is an open-source email client program which runs on Windows, Linux and Mac:
Assuming we have an Mbox file from one of the above scripts and we want
to read it and/or search it, for or own purposes, there are a number of
problems in doing so using a text editor, or a simple text viewing and
searching system such as opening the file as a .txt file in Firefox.
Firstly, the file can be too long for
these programs. I couldn't find a text editor which would handle
a 403MB file. Mozilla can't handle such sizes either.
Secondly, while the emails are (ideally) to our eyes, relatively simple
pieces of text, the way they are encoded in the file can look very
different. They could be in HTML. Even ordinary "plain
text" emails can be encoded in various ways, such as "quoted printable"
which frequently places a '=' character and a newline character in the
middle of words and encodes some characters in forms such as:
"=C2=A0". It is also possible for an email's body of
text to be UUEncoded, which makes it look like a block of completely
impenetrable gobbledygook.
These problems are resolved by loading the Mbox file into the "Local
Folders" section of Firefox. I found Firefox was perfectly happy
with a 403MB 100,000 message Mbox file. There was no problem
sorting the view of this mailbox by date, by subject, by sender etc. or
by viewing a list of messages which had a certain piece of text in
their Sender, Recipients or Subject line etc. For those who are
unfamiliar with Thunderbird, here is an image of what it looks like
searching for some text in the abovementioned lines:
Thunderbird-100k-messages.png . However, Yahoo Group deletes the server part of all email addresses.
However,
some messages did not display any text when opened. For
each such message, it is possible to view the source text of the
message with View > Message Source, in which case the text we want
to read might be plainly readable, or might be in HTML or
"quoted-printable". I haven't figured out what the cause of this
non-display is, but quite a few of the errant messages include this in
their headers:
Content-Type: multipart/alternative; . . .
Here is how I got the Mbox file into my Windows XP Thunderbird:
The file itself doesn't need to be altered or renamed. Any name will do.
Find your Thunderbird Profile Directory with the instructions here:
../../web-mail/Mozilla-mail/ Lets call this directory (in a Windows machine)
C:\xx\yy\zz\ .
From that directory find with Windows Explorer or whatever where
Thunderbird keeps its local folders AKA local mailbox files, which will
be something like
C:\xx\yy\zz\Mail\Local Folders\. For instance, on my
WinXP and
Windows 7 machines respectively, with "xyz" and "zyx" here standing in for a pseudo-random directory name:
C:\Documents and Settings\Robin\Application Data\Thunderbird\Profiles\xyz.default\Mail\Local Folders\
C:\Users\Robin\AppData\Roaming\Thunderbird\Profiles\zyx.default\Mail\Local Folders\
Close Thunderbird.
Copy the Mbox file to the above directory.
Run Firefox and find this mailbox in Local Folders.
Searching seems to work, but I haven't tested it extensively.
Right click on the mailbox name and select Search, which enables
searching for works in the body of the message as well as in the
Subject, Sender etc. etc. Even with multipart messages which
display blankly, as noted above, and with searching for text which is
actually split over different lines in the source, such as searching
for
half to find some comment
when it is in the source as:
half to find some co=
mment
seems to find the message OK.
Analysis of why some messages do not display text in Thunderbird
I haven' t looked at this problem in
detail. Here are my initial impressions, comparing a message
which was retrieved with grabyahoogroup (yahoo2mbox messages may have the same problem)
which does not display anything in its body with Thunderbird, with the
same message as sent by the Yahoo Groups system and stored in my local
email system (which is our own Postfix and Courier IMAP server).
The errant messages have the following characteristics. This is true of the individual message files in \MESSAGES\
and of the message when it is combined into the final Mbox format
file. Thunderbird reports the same text for these with View >
Source.
From the good file I have text fragments in green. From the
errant file I have text fragments in red. Some of the header
lines are in a different order. I am not sure how significant
this is
- - - -
MIME-Version: 1.0
Mime-Version: 1.0 (Apple Message framework v1085)
- - - -
Content-Type: multipart/alternative;
boundary="Apple-Mail-18--155222812"
Content-Type: multipart/alternative; boundary=Apple-Mail-18--155222812
- - - -
The first part of the body of the message:
--Apple-Mail-18--155222812
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset=us-ascii
--Apple-Mail-18--155222812 Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset=us-ascii
- - - -
At the end of the text of the first part, where "xxx" is the single word on the last line:
xxxx
--Apple-Mail-18--155222812
Content-Type: text/html; charset=US-ASCII
Content-Transfer-Encoding: 7bit
xxxx --Apple-Mail-18--155222812 Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
charset=us-ascii
- - - -
The HTML is quite different. At the very end of it:
</body></html>
--Apple-Mail-18--155222812--
</div><div><br></div><div>xxxx</div></body></html> --Apple-Mail-18--155222812--
Note that the errant file's text is all on one line.
- - - -
This is just from looking at a single message. It looks really
messy to fix. No-doubt I could write a C program to alter the
text to resolve these problems, but it would be a lot of work to figure
out what to do and to refine it to cope with the variations in likely
thousands of non-displaying messages.
.