Synchronizing a subset of (text) files
Like many people, I store most of my data in plain text files. Because of the tools I use (Notational Velocity, or rather, the alt version), I store them all in a single folder. It works great, my data is only a single key combination away, ready to be searched without leaving the keyboard.
There is a small caveat, though: as I’ve been doing this for a while (and migrating old data to this system), I now have many text files. 1134 at the time of this writing. And even though it does not cause any trouble with NV, it is not the case with iPhone applications. I have tried both Simplenote and Elements, and both present the same issue: upon starting them, it takes a while before I’m able to reliably search for information, typically the time do finish the synchronization of way too many files.
Now I really appreciate being able to edit and read notes on the go, and I’m ready to restrict myself to a few notes that I would keep for quick reference and editing. If I could have access to all my notes, when I need to find something a little less common, it would also be great. It would be fine even if it were in another application. And this post is about how to achieve this.
The basic idea is fairly simple: keep a copy of a subset of my text files in another folder. It needs to be a copy, so that I can access the whole set with NV without any trouble. This of course raises three main issues, which I will address in turn:
- how to tell the system which files to copy;
- how to make sure the contents of these files stay synchronized;
- how to add or remove files to the copy.
Specifying the files to copy
The first one is easy. I use a variant of Merlin Mann’s ‘q’ trick, except I’m using ‘a’. (It’s a nice key: it’s in the same position in Dvorak and Qwerty keyboards…) More precisely, at the end of the file (right before the extension), I put 3 to 5 ‘a’ between brackets to indicate the file is needed currently. Thus the files I copy in the small data set (SD in the rest of this post) are those ones.
There is a little bit more to it than this: not only these are the
files I keep a copy of in SD, but I will preserve the invariant that
only files ending with aaa] are there. If some files are not, we
need to do something special.
Synchronizing the contents of these files
File content synchronization is not too tricky. I’ve been using the Unison file synchronizer to keep my data synchronized between machines for many years. And the great thing is that it works as well locally. Here is the configuration file I use.
root = /Users/schmitta/Documents/Dropbox/Elements
root = /Users/schmitta/Documents/Dropbox/Text
ignore = Name ?*
ignore = Name .*
ignorenot = Name *aaa].txt
ignorenot = Name *aaa].md
backup = Name *
maxbackups = 100
The first two lines give the folders I want to synchronize. The “Elements” one is SD, the “Text” one contains everything (let’s call in “Big Data Set” or “BD”).
The next batch of lines tell Unison to first ignore every file when
synchronizing (thus doing nothing), then to not ignore files that
end with aaa]. I could be more lax and allow every extension, but
I’m using only the txt and md extensions at the moment.
The last batch of lines tell unison to keep a backup every time it makes a change, and keep up to 100 versions of each file. It may seem big, but these are tiny text files. (I just checked and I have 2.3 MB of backups for 5.6 MB of text files.)
The next problem is making sure Unison is ran each time a file
changes. The “brute force” approach would be to use a cron job (or
some launchd variant) to run it every minute or so. One could also use
the -repeat n option that runs Unison continuously, pausing for n
seconds between each run. I wanted to be smart and simply set up Hazel
to monitor these folders, and each time a file is modified there it
calls a script to run Unison.
Why a script, you may ask? Because we need to make sure our invariant is preserved before we do anything, because there are some options we need to set, and because we may want to keep a log of what happens. This is what my script looks like.
#!/bin/bash
DATE=`date +"%Y.%m.%d"`
LOGFILE="$HOME/Documents/Dropbox/Inbox/$DATE-Rlog-Text Sync Log.txt"
echo "Sync Elements <--> Text at `date`" >> "$LOGFILE"
if [ -n "`ls $ELEMENTS | grep -v aaa]`" ]; then
echo "* Some files do not end with aaa.
* We need to wait for Hazel first." >> "$LOGFILE"
echo "" >> "$LOGFILE"
exit 1
fi
~/bin/unison -batch -terse text >> "$LOGFILE"
echo "" >> "$LOGFILE"
(If this is ugly, don’t hesitate to drop me a line to help me improve it…)
This script uses a specially named file to log what is going on. It
first checks whether our invariant is preserved, and bails out if it
is not (we’ll see in the next step how to preserve the invariant). The
options Unison is called with are -batch (ask no questions at all)
and -terse (do not print status message, just the synchronizations
that are made). The last argument, text, is the name of the
configuration file to use, the one I showed above. With these options,
Unison runs without supervision. If a file has changed on both sides,
then it’s a conflict, the files will not be changed and the conflict
will be indicated in the log.
So I am now able to synchronize a subset of these files. I next need to make sure everything will work fine when I want to add or remove files for the subset.
Adding and removing synchronized files
For this last part, a little knowledge of how Unison works is required. Unison keeps an archive, which maps file paths to contents hashes. To detect changes at a path, Unison hashes the file and compare it to the value in the archive. If it’s different, the file has changed. (There are of course many optimization to avoid hashing everything all the time.) What’s more interesting is dealing with file creation and file deletion. If a file path is present in the archive but not on the file system, it means it has been deleted, so the deletion is propagated to the synchronized folder. Conversely, if a file path in is the file system but not in the archive, the file has been created, and this new file is then created in the synchronized folder.
So let’s look at what happens when we create, delete, or rename files in the subset of data (SD), or in the big data set (BD).
- Create in BD with name that ends with
aaa]: Unison will catch this and create a file in SD. - Delete in BD a file that ends with
aaa]: Unison will delete the file in SD. - Create a file in SD that ends with
aaa]: Unison will create the file is BD. - Delete a file in SD that ends with
aaa]: Unison will delete the file in BD. Thus deleting a file in SD does not mean “remove from the data set” but “do delete the file”. - Creating or deleting a file in BD that does not end with
aaa]: this will do nothing in SD. - Renaming a file in BD so that it ends with
aaa]: this will be seen by Unison as deleting the old file (thus no change in SD) and creating a new file (which gets propagated to SD as it has the right name). - Renaming a file in BD to remove the
aaa]suffix: this will delete the file in SD (which is what is wanted: we do not want to have it there anymore). - Creating a file in SD that does not end with
aaa]: this will do nothing (unfortunately): no file will be created in BD. - Renaming a file in SD to remove the
aaa]suffix: this will do delete the corresponding file in BD (unfortunately), leaving this file only in SD.
Note that the last two cases are the ones where there are files in SD
that do not obey the invariant (every file ends in aaa]). As we
check the invariant is preserved before running Unison, this is
fine. But we still need to take care of these files using yet another
script.
This script is called with the file name (that does not end with
aaa]) as argument. It basically goes through all the cases (is the
file present on the other side, with or without the aaa] extension)
and changes the file accordingly. Here it is with some comments
inside.
#!/bin/bash
DATE=`date +"%Y.%m.%d"`
LOGFILE="$HOME/Documents/Dropbox/Inbox/$DATE-Rlog-Text Sync Log.txt"
NAME=`basename "$1"`
BNAME="${NAME%.*}"
EXT="${NAME##*.}"
THISDIR="$HOME/Documents/Dropbox/Elements"
OTHERDIR="$HOME/Documents/Dropbox/Text"
OTHERFILE="$OTHERDIR/$NAME"
NEWNAME="$BNAME [aaa].$EXT"
NEWNAME2="$BNAME [aaaa].$EXT"
NEWNAME3="$BNAME [aaaaa].$EXT"
OTHERNEWFILE="$OTHERDIR/$NEWNAME"
OTHERNEWFILE2="$OTHERDIR/$NEWNAME2"
OTHERNEWFILE3="$OTHERDIR/$NEWNAME3"
echo "Running Hazel_sync at `date`" >> "$LOGFILE"
if [ -f "$OTHERFILE" ]; then
# This should occurs only if the user creates files on both sides
# simultaneously.
echo "*** Something is wrong: simultaneous file sync start: $NAME" >> "$LOGFILE"
diff -q "$1" "$OTHERFILE" > /dev/null
if [ $? = 1 ]; then
echo "*** Something is very wrong: the files differ.
*** Not renaming the other one.
*** It will be duplicated." >> "$LOGFILE"
else
mv "$OTHERFILE" "$OTHERNEWFILE"
fi
mv "$1" "$THISDIR/$NEWNAME"
else
# This is the normal case. First we check whether the file was being
# synchronized. If so, it means that we stop the synchronization.
if [[ -f "$OTHERNEWFILE" || -f "$OTHERNEWFILE2" || -f "$OTHERNEWFILE3" ]]; then
echo "File sync stop: $NAME" >> "$LOGFILE"
if [ -f "$OTHERNEWFILE" ]; then FILENAME="$OTHERNEWFILE"; fi
if [ -f "$OTHERNEWFILE2" ]; then FILENAME="$OTHERNEWFILE2"; fi
if [ -f "$OTHERNEWFILE3" ]; then FILENAME="$OTHERNEWFILE3"; fi
diff -q "$1" "$FILENAME" > /dev/null
if [ $? = 1 ]; then
echo "*** Something is wrong: the files differ.
*** Keeping both files." >> "$LOGFILE"
cp "$1" "$OTHERFILE-CONFLICT-`date`"
fi
mv "$FILENAME" "$OTHERFILE"
rm -f "$1"
else
# There is no file on the other side, so it was created and now
# needs to be propagated and synchronized.
echo "File creation: $NAME" >> "$LOGFILE"
mv "$1" "$THISDIR/$NEWNAME"
fi
fi
~/bin/unison_text.sh
To call this script, I use Hazel. Any file in SD that does not have
the aaa] extension is passed to this script.
Last words
Well, this was much longer than I thought. And looking at it, it may not be simple enough. But it works great here, and I can enjoy having twenty text files in Elements, and when I really need them more than a thousand text files in Simplenote, all in synch. Thanks for reading till here (if anyone does…)