About K-mers

Nucleotide k-mers are generated by taking a DNA or RNA sequence and breaking it into overlapping frames of the same size. The first k-mer starts at the first nucleotide and spans a set number of nucleotides. The second k-mer starts at the second nucleotide, and it spans the same number of nucleotides. The process terminates when the end of a k-mer reaches the last nucleotide in the sequence. K-merized sequences can be treated numerically. Since there are four bases (ACGT in DNA, ACGU in RNA), each k-mer can be treated as a base-4 number of a specific length. For k-mer size n, there are 4^n possible permutations of each base for any k-mer. This number is an integer, and it can be represented in decimal, binary, hexadecimal, or any other integer numbering system. Computer programs can compare integers very quickly, so representing sequences this way makes comparing and storing patterns very fast. Given a k-mer size that is large enough, patterns found in DNA and RNA sequences are highly unique. If time is taken to build a database of known k-mers, comparing real DNA and RNA sequences to the database is much faster than whole genome alignments. If multiple organisms are searched, this approach may be several orders of magnitude faster.

KAnalyze

KAnalyze is a set of tools designed to do k-mer analysis.

Tools:

  • KWrite: Given a set of sequences, write them to a database file and associate them with an organism. This tool is used to build the k-mer database.
  • KTag: Given a set of reads and a k-mer database, tag each read by matching the k-mer patterns against organisms in the database.

KAnalyze is a Java application. It consists of a common library that can be used independently of the individual tools (listed above). Each tool has a full command line interface as well as an API. Therefore, they can be run from the command line, run from a script, or integrated direclty into another Java program. The code is fully documented with Javadoc to make the APIs more accessible. All classes, methods, arguments, and exceptions are fully annotated. For all command line tools, a set of standard Linux/UNIX style options are available (via GNU Getopt). Each has a "-h" option for full help messages. KAnalyze is under construction and unavailable for download. When it reaches a state of maturity where it can be useful for other research, we intend to make the software and source code available for download.

Research

We are building and using this software suite to do fast viral pattern analysis from metegenomic data. Given a large set of sequences presumably from a human subject, we are looking for ways to k-merize the set quickly pick out viral patterns.