Thursday, 22 August 2013

FIFOs and Mapping with Bowtie using Gzipped Fastq files

So this will be a short one, but I didn't find any good solution to this problem online. I have compressed Fastqs that need Bowtie mapping for the genotyping of CNVs (Copy Number Variations) using CNVer. CNVer is a nice genotyping algorithm for CNVs that uses information from the read-depth and  paired-end mappings, but it needs a very specific mapping of the reads using the Bowtie aligner.  A minor technical difficulty arises from the fact that Bowtie doesn't understand gzipped files and requires vanilly Fastq files. Instead of running a gunzip, then Bowtie, then gzip, thus writing lots of stuff on the disk and suffering from the associated IO overhead, I wanted to do all this in a single step.

Another complication comes from the fact that Bowtie can't read data from the stdin, which prevents me from doing something like zcat my_compressed_fastq.fastq.gz | bowtie.
So the simple solution to this problem is to use named pipes, also known as FIFOs (First In First Out). Conceptually, they are the same as queues in real life. Let's say I'm waiting to pay for groceries at the store. If I was the first in line, I will be the first out of the line, unless someone cuts the line, but bytes never do that because they are respectful little creatures.

Concretely, this is done by using the mkfifo name_of_the_queue.fifo command in Linux or Mac OS. This command will create a queue on the filesystem that acts just like a regular file: we can write to it and read from it, but with the constraint of not being able to seek generic positions from it. In our case, we will create a queue, write the decompressed Fastq, and tell the second process in the pipeline (Bowtie) to read it's input from the previously create file-like-thingy (file-like-thingy == queue == FIFO).

mkfifo my_queue.fifo

zcat my_compressed_file.fastq.gz > my_queue.fifo &
bowtie reference.fasta my_queue.fifo
rm my_queue.fifo

The amperstand on line 2 is meant to start the zcat in a process without blocking the terminal. This way, Bowtie can start streaming from the queue as it is getting written.

To conclude, this method is efficient for two reasons. The first one is that there is no need to recompress the file using gzip as it would've been necessary to do if gunzip had been used. The second is that the decompression occurs concurrently with the mapping.
It is also important to note that since we're using a queue, it is impossible to "go back" in the file. This will only work if the file is read from start to end without ever looking back (you should never look back).

This was a simple article, but I hope it helps some of you saving some time and space.

Sunday, 27 January 2013

Computing Fibonacci

The Fibonacci sequence is often used to introduce the concept of recursion to new programmers. It has also been given a lot of interest because of his relationship to φ (phi), the mathematical constant that has often been observed in nature such as a ratio describing the arrangement of a pine cone, the spiral of snail's shells and other natural phenomena. Here, I will share an experiment I made where different algorithms computing the Fibonacci sequence were implemented in Python and compared.

The Naive recursion

The mathematical definition of Fibonacci is the following:

\[ f(n)=\begin{cases}
0 & if\ n=0\\
1 & if\ n=1\\
f(n-1)+ f(n-2) & if\ n\text{>1}
\end{cases} \]

The first terms of the sequence are: \(0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, ...\)

To compute this beautiful sequence, a programmer will be tempted to write a naive implementation that is very similar to the mathematical definition of the sequence. I wrote a sample implementation of this algorithm in Python:

def fibo_rec(n):
    """Naive implementation using the mathematical definition of
       Fibonacci numbers. This is by far the slowest method.

    if n <= 1:
        return n
        return fibo_rec(n - 1) + fibo_rec(n - 2)
As you can see, this implementation has the advantage of being really close to it's mathematical English counterpart and the code is an elegant recursion.

EDIT.Some people suggested an implementation similar to the one on the Python tutorial (except for the fact that the one on the tutorial is iterative). According to a comment from Reddit, the "more Pythonic" recursive implementation (see comments) might even be twice as fast. I chose the implementation presented here because it felt more readable to non Pythonists. It is also interesting to note that, in theory, the difference in speed is asymptotically negligible as \( O(n) \in O(2 \cdot n) \).

Unfortunately, this is pretty much the worst implementation one could write. Some even argue that it shouldn't be the de facto standard for teaching recursion in programming or computer science classes, as it currently is the case [1].
The following Figure representing the recursive calls to fibo_rec when fibo_rec(5) is called. It is easy to notice the problem with this solution: the same Fibonacci numbers are computed again and again.
Figure 1. The recursion tree with fibo_rec(5) as the initial call.
In fact, the number of recursive calls for this example is described in Table I.

Table I. Number of recursive calls to various fibo_rec(n).
n Number of calls
0 3*
1 5
2 3
3 2
4 1
5 1
* Note that if f(1) called f(0), as it is the case in a similar recursive implementation of this algorithm, we would count a total of 5 + 3 = 8 calls for f(0).

As the careful reader will notice, the number of recursive calls to different values of n follows a Fibonacci sequence. In this case, this results in a 9 redundant calls, but this number would quickly grow as we compute larger numbers. Nobody likes redundant calls, so we will look into a better solution.

EDIT. Some people were curious about a recursive implementation that memorised the previously generated numbers to avoid call redundancy. I quickly implemented an algorithm exploiting this idea:
def fibo_rec_mem(n, i = 0, j = 1):
    """Recursive version using tail-recursion

    if n <= 1:
        return j
        return fibo_rec_mem(n - 1, j, i + j)
This algorithm's performance is probably very similar to the iterative version that will be discussed shortly. The only problem is that it reaches the maximum recursion depth at n = 994, which makes it very impractical to generate Fibonacci numbers. Also note that the bitbucket repo has been updated with this version.

The intuitive iteration

To improve the previous algorithm's performance, we will memorize the 2 previously generated Fibonacci numbers and compute the sum at every iteration, until we end up with the desired number. The algorithm is the following:
def fibo_iter(n):
    i = 1
    j = 0
    for k in xrange(n):
        j = i + j
        i = j - i
    return j
I chose this elegantly written implementation even if it could be harder to understand. The first statement of the loop computes the Fibonacci number (j) and the second statement memorizes the previous number (i). After n iterations, we return j which is the desired number. This method apparently runs in O(n), but no formal proof will be attempted here. We will now show another interesting property that will allow O(log n) time complexity.

EDIT. Some comments mentioned the dynamic programming solution. This solution, from what I read about it, fills an array (e.g. Python list) with the Fibonacci numbers iteratively using the previously generated values to compute every subsequent number. This solution is very similar to the one presented here, but instead of using O(n) space to store the values in an array, I store what is needed for the computations (i and j) which is O(1) space. The dynamic programming versions is probably more useful if you need to memorize all of the numbers for later use. Jeff Erickson's interesting discussion on the subject is available online [6].

Fun with matrices

The interesting mathematical property we are all so excited about is the following:
\[ \left({
\begin{array}{c} {1} & {1} \\ {1} & {0} \\ \end{array} }\right)^n = \left({ \begin{array}{c} {f_{n+1}} & {f_n} \\ {f_n} & {f_{n-1}} \\ \end{array} }\right) \]
The proof of this identity probably requires mathematical knowledge that is beyond my current capacity, but you can empirically see for yourself using WolframAlpha:
or see fancy maths in Ly et al. [2].

EDIT. Some people mentioned an easy proof by induction of the aforementioned property. I attempted such a proof:
The basis of induction, for n = 1 is the following:
\[ \left(\begin{array}{cc}
1 & 1\\
1 & 0
f_{\text{1+1}} & f_{1}\\
f_{1} & f_{1-1}
f_{\text{2}} & f_{1}\\
f_{1} & f_{0}
Which does not invalidate our hypothesis.
Let's assume, for all n that our induction hypothesis is true and prove for n + 1:
\[ \left(\begin{array}{cc}
1 & 1\\
1 & 0
1 & 1\\
1 & 0
1 & 1\\
1 & 0
f_{\text{n+1}} & f_{n}\\
f_{n} & f_{n-1}
1 & 1\\
1 & 0
by the induction hypothesis
\[ =\left(\begin{array}{cc}
1\cdot f_{n+1}+1\cdot f_{n} & 1\cdot f_{n}+1\cdot f_{n-1}\\
1\cdot f_{n+1}+0\cdot f_{n} & 1\cdot f_{n}+0\cdot f_{n-1}
f_{n}+f_{n+1} & f_{n-1}+f_{n}\\
f_{n+1} & f_{n}
by matrix multiplication
\[ =\left(\begin{array}{cc}
f_{n+2} & f_{n+1}\\
f_{n+1} & f_{n}
\] by the definition of the Fibonacci sequence.
Since the final matrix corresponds to what would be expected for n + 1, our induction hypothesis holds and the proof is complete.

For us, that means that a better algorithm can be implemented using these properties:
def fibo_log(n):
    """This method uses interesting algorithmic strategies
       and is supposed to be O(log n). 

    i = 1
    j = 0
    k = 0
    h = 1
    while n > 0:
        if n % 2 != 0:
            t = j * h
            j = i * h + j * k + t
            i = i * k + t
        t = h ** 2
        h = 2 * k * h + t
        k = k ** 2 + t
        n = int( n / 2 )
    return j
I won't go in the details of this method, as it is visibly more complex than the previous solutions, but I will compare the performance of my different implementations empirically.

Empirical comparison of those beautiful algorithms

In order to compare the algorithms, I wrote a Python function that uses the cProfile module (available in the standard library). The function that allowed me to do so takes a function as an argument, and passes the other arguments and named arguments to the given function.
def time_call(f, *args, **kwargs):
    """Function that measures execution time of a given function f
       with args and kwargs as arguments.


    code = "{0}(".format(f.__name__)
    if args:
        for i in xrange(len(args)):
            if i != 0:
                code += ", "
            if type(args[i]) is str:
                code += "\"{0}\"".format(args[i])
                code += "{0}".format(str(args[i]))

    if kwargs:
        base = ", " if args else ""
        i = 0
        for k, v in kwargs.iteritems():
            if i != 0:
                base = ", "
            if type(v) is str:
                code += "{0}{1}=\"{2}\"".format(base, k, v)
                code += "{0}{1}={2}".format(base, k, str(v))
            i += 1
    code += ")"
This can be used to easily (but probably not optimally) compare different implementations. In this case, it allowed me to compile execution time, denoted T(n) for different Fibonacci numbers fn. The result is shown in Figure 2.
Figure 2. Graph comparing the execution time T(n) in seconds of different implementations of algorithms that compute Fibonacci numbers at different indexes n (notice the log-scale). The observation of Figure 2 show that the first algorithm that was described, the naive recursion is by far the slowest. It becomes unusable (in reasonable time) around n=30 where T(n) = 15.255s. The second slowest algorithm is the iterative version, but still is a better choice than the first implementation. It takes around 37s to compute the 786 432th Fibonacci number. Finally, the best implementation (shown here) is the one based on the matrix identity. This final implementation took 32s to compute the 9 000 000th Fibonacci number. Note that tests were ran on a virtual machine (using VirtualBox with 2048MB of RAM and 4 accessible CPU cores.

Finally, I just wanted to briefly mention another method that is really fast, but was too complicated to implement correctly for this brief article. The Fibonacci recurrence can be expressed as a simple expression known as Binet's formula: \[f_n=\frac{1}{\sqrt{5}}(\varphi^n-\varphi'^n)\] where \( \varphi=\frac{1+\sqrt{5}}{2},\ \varphi'=\frac{-1}{\varphi} \).
A variation of this formula can be used to compute Fibonacci numbers:
SQRT5 = 5 ** 0.5
PHI = (1 + SQRT5) * 0.5
def fibo_lin(n):
    """This method uses Binet's formula, which is mathematically
       correct and very fast, but is not easily implemented
       because of overflow errors at the PHI ** n step.

       Also, floating number arithmetic seems to introduce
       errors when n > 70.
    if n == 0: return n
    return round(PHI ** n / SQRT5)
This is actually very, very fast. The only problem comes from the handling of large floating point numbers, which makes it hard to compute numbers over n ≥ 70 without rounding errors. A solution to this problem exists, as described in Robin Houston's blog [3]. This is probably the best way to compute the Fibonacci numbers.

To conclude, I wrote this article to give a real-life perspective on different approaches to the computation of the Fibonacci sequence. If you're interested in reading more about Fibonacci, you can look at the Cited Works section, or these excellent Wikipedia articles on the subject [4, 5]. I hope it helped some of you and, I'm open to your feedback or suggestions. Also note that all the code that is described here is available on my Bitbucket page.

Cited Works