Diff: RenderingEverythingAsText
Differences between version 2 and previous revision of RenderingEverythingAsText.
Other diffs: Previous Major Revision, Previous Author
Newer page: | version 2 | Last edited on October 11, 2007 10:28 am | by PhilHollenback | Revert |
Older page: | version 1 | Last edited on January 22, 2007 6:56 pm | by PhilHollenback | Revert |
version 2
Rendering Everything as Text
by Philip Hollenback
Originally Published May 26, 2005 on LinuxDevCenter.
You can display any computer data as text. For many types of data, this is obvious: we've all seen HTML converted to text right in our web browsers. However, this idea can extend much further. Although the notion of converting all data to text may not sound immediately useful, it can be surprisingly powerful. Why wait for graphical web browsers or image editors to load before you find out what's in a file? With a few helper applications and scripts, you can quickly display textual information about any type of data.
Why Go to All This Work?
You might wonder why anyone would go through the trouble of figuring out how to display all files as text, especially in this day of 21-inch LCD screens. One big reason is a uniformity of experience: you never have to leave your mail reader to evaluate a file. Suppose that someone sends you a Word document. To learn anything at all about this file, you have to first save it and then run OpenOffice (or maybe even Microsoft Word). Then you have to sit around and wait for Word to load, wait for the file to load, and so forth. Finally, you find out that the attachment is just a copy of this month's TPS report with a note that you forgot the cover sheet.
If you have your text-based mail reader and the helper tools configured properly, you can bypass those extra steps. With the file immediately converted into text right inside your mail program, you can easily see that there's no need to run OpenOffice - you just need to send a reply email.
Another big advantage of the anything-to-text approach is that it helps you avoid viruses and other Internet shenanigans. Spammers often send bizarrely formatted HTML mail to disguise their actions. If you convert everything to text before viewing it, you can clearly see what is going on. You have also avoided your GUI and web browser, where viruses often attack.
The Tools
This technique revolves around the command line. In particular, converting files to text is most useful in command-line mail clients such as mutt. Of course, there's no reason you can't convert an attachment to text in a graphical mail program and then display it in an Xterm. Remember also that you have just the command line when you SSH into a remote system, unless you take additional steps such as forwarding X over SSH.
Additionally, while these ideas are oriented toward Unix and Linux in particular, with modifications they apply to other systems. In particular, all of this works on Mac OS X, which I use on a daily basis.
It's important to understand how mail clients process attachments. The
determination of how to process an attachment is controlled by the
mailcap file (first your private ~/.mailcap and then the shared
/etc/mailcap). Every email attachment has a MIME type, which is
assigned by the sending mail program. Whenever a MIME-aware program
encounters a MIME type (such as text/html
), it consults
the mailcap to find a matching entry. Each line in the mailcap file
constitutes one entry. If the MIME attachment is of type
text/html
, the matching mailcap entry might be:
text/html;view-html %s; copiousoutput; +nametemplate=%s.html
which instructs the calling program (mutt) to use the view-html
program on all text/html
attachments.
That's just a quick overview of the mailcap mechanism. An excellent resource for more details is the mutt manual.
A Basic Example
Let's start with the HTML attachment. This is easy to convert to text
since it already is text with additional markup. You could send the
raw HTML out to the console as a start. However, you can do
better. Sending the HTML through a text-based web browser can preserve
some of the original formatting such as paragraphs and tables. Either
the standard Lynx text web browser or a more sophisticated text
browser such as w3m will work fine. The view-html
script
from the previous section might look like:
if type w3m >/dev/null 2>&1; then w3m -T text/html -cols 80 -dump $1 | tr \\240 elif type lynx >/dev/null 2>&1 ; then lynx -dump -force_html /dev/fd/0 <$1 else echo $0: can't find w3m or lynx >&2 exit 1 fi
The idea is to call either w3m or Lynx and tell that program to dump
the rendered output to stdout as text. That odd /dev/fd/0
file in the Lynx command line is necessary to trick Lynx into
accepting data on standard input - it says to open file descriptor 0,
which is stdin.
Now using the mailcap entry from the last section and the script
above, mutt can display any HTML as text. Were you wondering what that
copiousoutput
thing in the mailcap is about? That flag
tells the calling program that the results from the mailcap entry will
be text output, with no interaction necessary. Entries without that
flag may require user interaction; for example, if you sent an image
to a graphical image viewer. Mutt can use this information to display
the text inline while you are viewing a message, instead of making you
go to a separate screen. To enable this, add auto_view
entries to your ~/.muttrc config file for each MIME type you wish to
view as inline text, like this:
auto_view text/html application/msword
Keep in mind that many data formats are easy to convert into HTML, so this recipe is a useful building block for other conversions.
Extracting Text from Microsoft Files
The closed nature of the Microsoft programs and their associated data
files makes it highly challenging to extract text from them. However,
plenty of people have worked diligently on these data files to achieve
a large measure of success. It is possible to extract the text from
Word, Excel, and Powerpoint files, thanks to wvHtml
,
xlhtml
, and ppthtml
. The wvHtml
program is part of the wvWare
suite. The other two programs are part of the
xlhtml
utility.
I said previously that the HTML-to-text conversion is a useful
stepping-stone. Here is an example of that; the tools for the
Microsoft files all convert to HTML. By piping the output of (for
example) xlhtml
through a text-mode HTML viewer, you can
obtain often very readable text. Here's the sample script, similar to
the one above for HTML to text:
if type w3m >/dev/null 2>&1 ; then xlHtml $1 2>/dev/null| w3m -T text/html -cols 80 -dump | tr \\240 elif type lynx >/dev/null 2>&1 ; then xlHtml $1 2>/dev/null| lynx -dump -force_html /dev/fd/0 else echo $0: can't find w3m or lynx >&2 exit 1 fi
Again, it's good to use w3m if you have it. This is particularly true with Excel files, as the table rendering in w3m is so much better than the rendering in Lynx.
The process is much the same for Microsoft Word files, but you have to play some tricks with wvHtml to make it send the file to stdout:
if type w3m >/dev/null 2>&1 then wvHtml $1 /dev/fd/1 2>/dev/null| w3m -T text/html -cols 80 -dump |\ tr \\240 elif type lynx >/dev/null 2>&1 then wvHtml $1 /dev/fd/1 2>/dev/null| lynx -dump -force_html /dev/fd/0 else echo $0: can't find w3m or lynx >&2 exit 1 fi
The basic approach for data files that are mostly text based is pretty simple: find a utility to convert the file to HTML and convert that HTML to text. Again, the mailcap file determines how to process a file (or MIME attachment). Here are the entries for the Microsoft file formats:
application/msexcel;view-excel %s;copiousoutput; +nametemplate=%s.xls application/msword;view-msword %s; copiousoutput; +nametemplate=%s.doc
The nametemplate=
entry ensures that the file goes to the
conversion program with a proper file extension. Some programs insist
on the correct extensions for files.
One big annoyance you will see quite often with Microsoft file formats
is a MIME attachment with type
application/octet-stream
. Basically, that is the default
MIME type. If the sending program can't (or won't) figure out what
kind of file it is sending, it can just throw up its hands and say
Hey, here's a stream of bytes - you figure it out.
Using the power of Unix/Linux on the receiving end, you can fix that
problem. The octet-filter
script uses the file extension and calls the file
utility
to reconstruct the proper MIME type. Then it hands the file off to the
right helper. The proper mailcap entry is:
application/octet-stream; octet-filter %s;copiousoutput
Several other Microsoft formats worth mentioning are RTF files and TNEF attachments. RTF (Rich Text Format) is a simple text-based markup language, and TNEF is a mechanism that Microsoft servers use to encapsulate MIME data for Microsoft clients. Again, there are utilities to handle both of these, such as TNEF, the TNEF decoder, and rtfreader, an RTF-to-text converter. The mailcap entries are:
application/ms-rtf; rtfreader %s; copiousoutput application/ms-tnef; tnef2txt %s; copiousoutput
Falling Back to File Manifests
What is the textual representation of a ZIP file? The best answer I have come up with is a file manifest. This is the answer for any MIME attachment that is a collection of files. Examples include .zip, .tar, and .jar files. In each case, you can run the corresponding command on the attachment to list the files within. This is certainly better than doing nothing, because it gives the user a chance to see what he's downloaded before opening it up. Here's the mailcap entry to generate a manifest for a .tar file:
application/x-tar; tar -tf - ; copiousoutput;
The manifest idea also applies for images (although there's a much
more creative approach in the next section). At the very least, you
can extract some basic data from the image and display it. Typically
this includes the file size, number of colors, and embedded
comments. The identify
program (which comes with the
ImageMagick collection of image tools) prints the following
information about a jpeg file:
Format: JPEG (Joint Photographic Experts Group JFIF format) Geometry: 195x195 Class: DirectClass Type: true color Depth: 8 bits-per-pixel component Colors: 25594 Resolution: 28x28 pixels/centimeter Filesize: 12.1k Interlace: None Background Color: grey100 Border Color: #DFDFDF Matte Color: grey74 Dispose: Undefined Iterations: 0 Compression: JPEG comment: Test Image signature: 7e546210e516fd2e870ee9df47f0bfc15a9ec0d431c5abeb5a92cf0e811f9f2a Tainted: False User Time: 0.0u Elapsed Time: 0:01
Again, that's better than nothing, right? Here's the mailcap entry:
image/*;identify -verbose %s;copiousoutput
You can turn MP3 files into text in a similar way. The ID3 standard defines a set of text tags such as artist and title that can be embedded into an MP3 file. A utility such as id3v2 can extract this information and display it as text:
$ id3v2 -l Yeah_Yeah_Yeahs-Machine.mp3 TT2 (Title/songname/content description): Machine TP1 (Lead performer(s)/Soloist(s)): Yeah Yeah Yeahs TAL (Album/Movie/Show title): Machine TYE (Year): 2002 TCO (Content type): Rock (17)
With a little formatting, that makes a great text representation of the file.
A More Complicated Technique: Images to Text
A summary of an image is pretty interesting. Wouldn't it be even better if you could see some sort of textual representation of the image itself? This final conversion does just that. It's more a cute hack than a real tool, but it does illustrate my mantra of anything-to-text.
The trick here is to use the aalib ASCII-art library. aalib is a graphics driver that displays images using only ASCII characters, in the style of the old line-printer art. The aalib algorithms are smart enough that the result is something that looks vaguely like the original image from a few feet away.
The viewer that uses aalib to convert images to text is
asciiview
. Unfortunately, it doesn't work exactly like a
filter, so a bit of scripting is necessary. This Perl script:
#!/usr/bin/perl $ARGV[0] || die must supply image file; open(ASCII, echo q | asciiview -driver stdout -kbddriver stdin $ARGV[0] 2>/dev/null |) or die failed to open $ARGV[0]; while(<ASCII>) { last if /\x0C/; } while(<ASCII>) { last if /\x0C/; print; } close ASCII;
takes any image and displays it as text. Figure 1 shows the original image.
Here's the result from asciiview
:
|=+=|==++)SZZZZZXXXd#Z211YSYSZ####qpoZX#mqmXA2+:.:-:S2XXX( =%vxliliii|=*UZXmX##Zexl*???Tqgu*S1dX#ZmZ#XXZXma::..::::!S( )xXa%|||++|=a3XXmZ#UXvissaauSixYWmApdXmX#Z#mXZXXXoi,..:==||; =xuXXi====<xuSXYXmZXoIl*?SouyXmau3ZZm2XXXZdXZS2nnXn>-.=+|=+` =vnoov+=;:<xX1x1wdZX2nxlss%xi?X##mXqon22SnZXXXoxnvnn=::|;==; <lv1i|=;:=sxlx3Sx2XXXXXXnoX1XonaoXZ#ZZZoox12SonnxYnnn>;=+=: )=;:-=++|iiiiIiixXXZ#Z#Z#ZXXXXXXXXSXS2oXnx%{inxS1x1nxli===;. :==<=xi|i||||iivXXZUZ###UUZ#ZUZZZZZXXXZXooxxivIliiIIxIxi;::. :||xnns<n|||||xnXXZZ#Z#Z#U##Z#########Z#ZZXXonlii|||iino>==. :|+}IoodXo==+ioXXZZ#Z################Z##Z#ZZZoi||||>+<XSoc:. :+auXXZZXX===|oSXZZZZUZUZZZUZ#ZZZZXXSSX#ZZ#ZUXi|+=+||a1XXoc: =XSnXoXSXXcii|*l*!!++++IXSXXXX2I|=+++!!!YSXZZl=iii1o2onS2o; )2noXXonxXXss:===...:....-+S1::;:..:.:==+|=suai|xXoZX2o2S2( =nnInSXXoxxXo==|+=;;:.:===d#mmc;::-:::=||||xXZXnxoXZ22ooSnS( )Ioox*noXXsIS%oaaa>||><saX##m##qa%|=|=|<aau#ZZZXd#ZXn2nzxxS( )lv2on%InSoIl1XXXXZ#ZZXXXX##m#Z#ZXXZ#Z#UZZ#ZUZZUZXZSnnos++<; )lvxISosiI1n||2XXZXX2noXYSSXXXSYSXXoX1ZXZZZZZZUZZZSzvxni:=<; )xx}xI1ns<||{io22onvuXZXooxxxnuXZ#ZZXonnSXXXZZZX1nn1xvni:=+; =nnncxlx%i|;:;3o2SoXX????Y!?Y??YY?YYY3XoXXXXZZZXoviun1li=;=: )vosi1|x%l=::=)n2SXXXoc|YISXXZ#X2nlisuXZZXZXXXXonn21Iil>::;. =vnonx%s<ii>||<i12SSSXXaaxx12121XXXZZZZXZXZXXXZZ2IllvIx(..:. =vvnnvvxiii==+|||{1XSXXSSXXoXXXXXXZZZZZZZXS2SXZZm,:.-.:<;::. -{nxvx1%i|i=;++==+|i*XXXZZZ##Z#U#Z#ZZZZS11ndXZ#ZZ#a;:.=x|+=, .:+1nIx1ii|=.==|+|||i|*S2XXXXXXXXXXYY1IxxSSXZUZZ#U##a=unl%%; .:;=*|l*i|==||=:;==+==|iI1IIIIIIIllIvvooSXXZZZUZ#ZZ(x1svno(
As you can see, the converted image won't win any awards for fidelity, but it is vaguely similar to the original. The key point here is that you could quickly evaluate the text image and decide whether it was worth it to open the original in a graphical viewer. For example, some people send email with tiled background images. When you are using a text-based mail reader, you can never tell what this image is until you open it in a viewer. When you see that it's a piece of notebook paper with pink flowers on it, you realize what a waste of effort it was to open it. Maybe if you could take a quick look at the ASCII version of the image you could save some time.
I should emphasize that it takes way too much processing power to create the ASCII image. This makes the conversion of images into ASCII more of a technology demonstration than a useful tool, unless you have a lot of cycles to burn.
Conclusion
There's no reason to limit yourself to the GUI world! The console screen may at first seem like a step backward in technology. However, you may find (as I have) that you can perform many tasks more efficiently without the graphical distraction. If you're a mutt user, you may find my mutt scripts useful.
As I said in the introduction, you can convert any computer data to text in some way. With text-based data such as HTML or Word documents, the conversion is quite faithful to the original. However, even with completely nontext data, some text representation is always possible. You can check file manifests and text tags for text descriptions. Finally, images can become surprisingly realistic ASCII art (for some definitions of realistic).
The quest then becomes the search for the best text representation of each type of data. For example, is there a way to describe music as text? We already do that with scores. Perhaps it will be possible to use some pattern matching and web search to find the lyrics for a song.
With a little clever application of mailcap entries and helper scripts, you can convert any data into some form of textual representation. The end result is a more efficient work environment as you avoid the overhead of the GUI. As an added benefit, you might even be a little more secure as you reduce your exposure to mail viruses.