What is Genomic Spiral Seive

Welcome to the genomic-spiral-sieve wiki!

#Motivation

Genomic medicine is the eventual future of intelligent healing. It could find cures for diseases which our current generation of humans thought was not curable.

#Introduction

Living organisms tend to have a DNA.

A DNA is made of a long running chain of nucleotides. Often this is addressed as DNA Sequence.

The following image is a textual representation of how a DNA sequence looks like:

#Problem

These DNA sequences are really long. Comparing two DNAs or finding a sub-sequence of disease causing sequence in a person's DNA takes too much time.

#Solution

In Information theory, there is a concept of compression.

There are two basic types of information compression:

The well known ZIP file is just application of Loss-Less compression algorithm on a given file to reduce its size on file system without losing any data.

This is the first step of Genomic Spiral Sieve is to compress the DNA sequences without any loss.

What is Genomic Spiral Seive

Genome - is a sequence of nucleotides - A T G C. Spiral Sieves - are mathematical systems that give out numbers of a specific pattern.

Genomic Spiral Seive is a system which provides a compressed analysis mechanism for quick operations on a large DNA sequences using information compression techniques.

The ASCII character code for ATGC are

A 1000001

T 1010100

G 1000111

C 1000011

Compression Algorithm

A simple hash compression :

A 1000001 after compression 00

T 1010100 after compression 01

G 1000111 after compression 10

C 1000011 after compression 11

So a simple ATGC sequence becomes

ATGC after compression looks like 00011011 To decompress - 00 01 10 11 = A T G C

What's the value?

Lesser Space

Instead of using 16 bits to represent a nucleotide GSS uses just 2 bits.

That's 14 bits saved for a single character. One typical DNA might have close to 6 billion of them.

Now imagine the space saved for a whole DNA database!

Lesser Time

A CPU typically compares two given character A & A by comparing all the 8 bits used in ASCII character code.

After GSS compression to compare two characters, CPU takes just 2 bits to compare.

Hence GSS saves atleast 6 CPU instructions per nucleotide comparision.

I would leave the reader to compute the total instructions which could be saved if GSS based platform is used for DNA searches.

Tip of the ice-berg

GSS compression algorithm is just the base. There is a long way to go.

Saving 6 bits is good. Saving 6 instructions is good. This is just the tip of the ice-berg.

Could we create a Unix "grep" like super fast string search algorithms leveraging the high-concurrency abilities of languages like golang?

Would be be able to check a given DNA and check for all disease causing gnome sequences in seconds? That's GSS ultimate goal!

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
go		go
py		py
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
seq.txt		seq.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

go

go

py

py

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

seq.txt

seq.txt

Repository files navigation

What is Genomic Spiral Seive

Compression Algorithm

What's the value?

Lesser Space

Lesser Time

Tip of the ice-berg

About

Releases

Packages

Languages

License

thinkination/genomic-spiral-sieve

Folders and files

Latest commit

History

Repository files navigation

What is Genomic Spiral Seive

Compression Algorithm

What's the value?

Lesser Space

Lesser Time

Tip of the ice-berg

About

Resources

License

Stars

Watchers

Forks

Languages