From a computational point of view,
Deoxyribonucleic acid (DNA) is long strings made of four different letters ({A,
C, T, and G}). In contrast, from a biological point of view, DNA is the
hereditary molecule in almost all living things. It contains the instructions
of growth, functioning and reproduction of an organism. The DNA is made up of
four chemical nucleotides: Adenine (A), guanine (G), cytosine (C) and thymine
(T). A DNA has a double-stranded structure, where each molecule is hold by the two
strands and DNA bases pair up with each other, A with T and C with G, to form
units called base pairs and series of different base pairs made DNA fragment.  The procedure of determining the sequence of
nucleotides in a DNA fragment is called sequencing. It is not feasible to read
the entire sequence of a genome at once by using present technologies, which
can only sequence small DNA fragments consisting of a few hundred nucleotides.

In 1975 Frederick Sanger
developed a sequencing technology known as Sanger sequencing. While this
technology has been constantly enhanced over 30 years, it can only sequence
about 500 to 1000 bps of DNA at once. The process of constructing reads by the
Sanger technique is very sluggish and costly. The Sanger technique conquered
the world of genome sequencing for over two decades and led to a number of events,
including the conclusion of the sequencing of the human genome. Sanger
sequencing is generally known as the first generation sequencing technique (Sanger et al. 1977).

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

The Human Genome Project (HGP) was
a global study project that started in 1990. The goal of HGP was to sequence
the entire human genome. The first sketch of human genome was printed in 2000.
The project was finished in 2003, when the first whole human genome was
published. About one third of the sequences of the HGP project were produced
using the Sanger sequencing technique. Sanger sequencing also played an important
role in acquiring the DNA sequence of mice (Collins et al. 2001).

DNA sequencing has revolutionized
current advancements in the field of science and technology. It has been widely
used in applied field of medicine, genetic engineering, food science etc (Sperber
et al., 2008). In current era, Next Generation Sequencing (NGS) is the most
advanced technology of DNA sequencing, which provides more accuracy and speed
than previously known Sanger sequencing (Buermans et al., 2014).  Paired-end sequencing in NGS, which involves
the sequencing of both forward and reverse fragments of DNA, has further increased
the accuracy and ability to detect indels which otherwise was not possible in
single-end sequencing (Grimm et al., 2013). Next generation sequencing
technique produces millions of short reads which, without a reference genome,
is one of the challenging task for de novo assemblers (Shendure et al., 2008).    In the past few years, several de novo
sequence assembling algorithms have been developed to handle and assemble the
large amount of data in the form of contigs but choosing the appropriate assembler
for paired-end or single-end data is still a challenging job (Baker et al.,
2012).

The currently available
assembling algorithms include de Bruijn Graph (DBG), Overlap Layout Consensus
(OLC), string graph, greedy and hybrid algorithm (Miller et al., 2010). DBG is
the graph algorithm based on k-mers approach, which split the short reads into
smaller k-mers, and these k-mers overlap by k-1 which is the next k-mer. Dividing
the sequences into smaller sizes also help improves the crisis of different
initial read lengths, whereas, OLC is also the graph based algorithm which
builds overlap graph by overlapping the similar sequences (Kang et al., 2013).
Finding Overlapping sequencing is usually the slowest part of the assembly and
these overlapped sequences then pack fragments of the overlap graph into
contigs. 

DBG algorithm is fast and OLC
algorithm executes better for longer sequence reads. String graph algorithm is
the variant of OLC algorithm, which performs global overlap graph by
eliminating unnecessary sequences (Li et al., 2012). Greedy algorithms start by
joining the short reads that are best overlapped to produce contigs. Most
greedy assemblers use heuristic techniques that are designed to eliminate
misassembling of recurring sequences (Pop et al., 2002). Hybrid assembling
algorithm refers to the mixing various assembling algorithms. It is used to
reduce the number of contigs and errors produced by other algorithms (Koren et
al., 2012).