Saturday, November 2, 2013

Make sense of a lot of messy text with TreeLine

Marco Fioretti takes a look at TreeLine, and open source outliner for Linux, and explains how he uses it to reformat and structure huge, messy, pre-existing blocks of text.

A software outliner is a program that lets you create and store units of data, normally text, in a more or less hierarchical structure. TreeLine is an open source outliner for Linux that may double, according to its own web site, as a Personal Information Manager (PIM). I've tried TreeLine and, while I can confirm both of those assertion, I mostly use it to make sense of pre-existing documents. I'll explain what I mean in a moment. First, let's take a look at the main features of TreeLine.


Getting started with TreeLine

Installing TreeLine is really simple. If there isn't a ready binary package for your distribution, all you have to do is to download the compressed archive from the web site, unpack it in a temporary folder, and run the included script as root. 

TreeLine can manage tree structures made of generic nodes. Each node can contain several fields, more or less like the single records of a relational database. Out of the box, the software contains templates with nodes (Figure A) that can store Books, Contacts, or ToDo Lists, plus generic plain texts or HTML documents.

Figure A
Figure A
TreeLine templates.
The node fields can be dates, numbers, URLs, booleans (variables with only two opposite values, like Yes or No), and generally anything that is or can be expressed with text. Once you've become familiar with the program, you can define both new types of templates and nodes and the way TreeLine displays them in its window. 

The left panel of TreeLine shows the nodes, either as a tree or a flat list. The right-hand panel, instead, has three tabs: one displays the currently selected node(s), the others are for editing either the content or just the title of a node. Basic search and filter capabilities help you to find the nodes that contain certain strings quickly.

Getting stuff in and out of TreeLine

When it comes to importing, TreeLine is pretty good. It will load OpenDocument files, several bookmark formats (Figure B), and various types of text files (Figure C). While TreeLine native files can be compressed and/or encrypted automatically, they're nothing but XML -- terribly verbose, yes, but plain text that you may easily process with any text parsing tool that you're familiar with.

Figure B
Figure B
TreeLine files.

Figure C
Figure C
Text import methods.
The export function is equally flexible (Figure D), and it's also possible to generate HTML or PDF versions of your notes.

Figure D
Figure D
TreeLine export function.


Pros and cons of TreeLine

Even if you just sticked to the few main functions I've mentioned, TreeLine could be very useful. The interface puts all the buttons in plain view that you need to easily enter, move around, indent, or edit nodes. From my experience, the stable version of TreeLine (1.4.1) has two issues:
  • When dealing with big trees (trees with thousands of nodes, with a total file size of about 15 MB), TreeLine is slow. It can take several seconds to move from one point to another of the tree or to open or move a node.
  • It's not possible from the user interface to merge several nodes into one. The TreeLine author said that this function may be added with a plugin. However, since TreeLine files are just XML, it's possible to write quick and dirty scripts that do the same thing outside of TreeLine (see my final notes below).

Reverse outlining

The main reason I launch TreeLine these days is for what I call "reverse outlining" -- that is, instead of sorting data and notes into one well-structured tree as they come to me, I use this program to reformat and structure huge, messy, pre-existing blobs of text. 

For example, I had a big text file of notes for an essay that I'd like to write some day, and it was as chaotic as they come -- an embarrassing collage of handwritten reminders, whole mailboxes, web pages copied and pasted with comments, headers and all, by many different people, for a total of thousands of paragraphs that were formatted in almost as many different ways. 

Transforming that mess in something usable as reference for the actual work of writing the essay would have taken weeks if I had to do it with a normal text editor. Fortunately, TreeLine loaded and converted the whole thing into its internal format without particular problems. At that point, I was able to edit and rearrange paragraphs from the TreeLine interface (Figure E).

Figure E
Figure E
The TreeLine interface.
It was a slow work, due to the size of the file. However, it was still much quicker and more structured than attempting the same task with a normal text editor or word processor. I highly recommend that you try TreeLine every time you have a problem similar to mine.

Final note

Personally, I've solved the "cannot merge nodes" problem in a very quick and dirty way. When I have many consecutive nodes that I want to merge, I mark the first one with the "MERGE_HERE:" string at its beginning (Figure F), save the file, and close TreeLine. 

Figure F
Figure F
Merging nodes workaround.
Then, I run a Perl script that merges that paragraph and all of the consecutive paragraphs at the same level of hierarchy, and when I reopen the file in TreeLine, that part of the tree looks like Figure G

Figure G
Figure G
Run a Perl script to merge nodes.
I'm not posting the code here,  simply because it's a real quick and dirty solution that may not work in future versions of TreeLine. If someone really wants that code, just let me know -- and, of course, you're welcome to post your solution in the discussion thread below. In any case, rest assured that it is possible to integrate and complement TreeLine with any text processing free software.