ksw2c aligner (ksw2_cigar)¶
Documentation¶
Copyright (c) 2021, Huxelerate S.r.l. All rights reserved.
Provides a set of APIs to interact with the HUGenomic ksw2_cigar core that performs accelerated sequence alignment on FPGA.
The core functionality is based on the ksw2_extd method from the ksw2 library: https://github.com/lh3/ksw2
The core computes the optimal scores and the CIGAR for global and extension alignment of two sequences using dual gap affine cost function and a custom scoring matrix. The core also supports banded alignment and the z-drop heuristics: https://doi.org/10.1093/bioinformatics/bty191
Before running the mothods of this library, you must ensure that the target FPGA is properly configured with the ksw2_cigar core through the fpga_configure function.
- Note
If you only need the optimal scores but not the CIGAR it is recommended to use the ksw2_score core since it provides better performance.
-
namespace
hugenomic
Typedefs
-
typedef uint32_t
ksw2c_cigar_op
¶ The data type used for storing a CIGAR operation within a CIGAR (e.g. 10M, 2I, 4D). A CIGAR operation encodes both the operation type and its length.
- Note
We use the same encoding of the ksw2 library.
Functions
-
char
ksw2c_get_cigar_op_type
(ksw2c_cigar_op cigar_op)¶ Retrieves the operation type of a CIGAR operation.
- Return
The operation type, can be one of:
’M’: match/mismatch
’I’: insertion
’D’: deletion
’U’: undefined (only if the cigar operation is incorrect)
- Parameters
cigar_op
: The cigar operation
-
uint32_t
ksw2c_get_cigar_op_length
(ksw2c_cigar_op cigar_op)¶ Retrieves the length of a CIGAR operation.
- Return
The length of the CIGAR operation
- Parameters
cigar_op
: The CIGAR operation
-
std::vector<ksw2c_response>
ksw2c_batch
(uint32_t fpga_id, const std::vector<std::pair<std::vector<uint8_t>, std::vector<uint8_t>>> &pairs_of_sequences, int8_t start_gap_cost, int8_t extend_gap_cost, int8_t start_gap_cost_2, int8_t extend_gap_cost_2, const std::vector<int8_t> scoring_matrix, int32_t bandwidth, int32_t zdrop, int32_t end_bonus, bool extension_only_cigar)¶ Performs a sequence alignment of one or more pairs of query and target sequences on a target FPGA and returns the alignment scores and the CIGAR for each pair of sequences.
The ksw2_cigar core computes at once the score of 4 different types of alignments. These alignments are: global alignment, global query alignment, global target alignment, extension alignment. For each alignment type, each pair of query and target sequences are aligned from their beginning up to an end position (query, target) that depends on the alignment type as follows:
global alignment (query_length - 1, target_length - 1)
global query alignment (query_length - 1,
FREE
)global target alignment (
FREE
, target_length - 1)extension alignment (
FREE
,FREE
)
Where
FREE
means that the algorithm selects the position that maximizes the alignment score.The core returns only the CIGAR of one alignment among: extension, global and global query alignment. To specify which CIGAR to return the user can use the extension_only_cigar and the end_bonus parameters. The following rules determine which CIGAR is returned:
global alignment CIGAR
when extension_only_cigar = false
global query alignment CIGAR
when extension_only_cigar = true AND global_query_max_score + end_bonus > extension_max_score
extension alignment CIGAR
when extension_only_cigar = true AND global_query_max_score + end_bonus <= extension_max_score
However, if the alignment is z-dropped, the extension alignment CIGAR is returned by default.
The cost function for gaps is computed as the minimum of two affine gap functions using the following formula, where l is the length of the gap:
min{start_gap_cost + l * extend_gap_cost, start_gap_cost_2 + l * extend_gap_cost_2}
The cost function associated to a match/mismatch is provided by a scoring matrix. The scoring matrix is an M*M matrix encoded as a linear array that determines the score to assign when comparing an element q of the query to an element t of the target sequence. In details, the score is assigned as:
score = scoring_matrix[t * M + q]
The value M must be in the range [2, KSW2C_MAX_SCORING_MATRIX_SIZE].
Notice that the values stored in the query and target sequences must be between 0 and M-1 otherwise an std::invalid_argument will be thrown.
Furthermore each sequence cannot be longer than KSW2C_MAX_SEQUENCE_SIZE.
PERFORMANCE TIP: you should always try to maximize the number of pairs of sequences in order to obtain maximum performance.
- Return
Vector of ksw2c_response containing the alignment result for all pairs of sequences
- Parameters
fpga_id
: The identifier of the target FPGA to usepairs_of_sequences
: The pairs of query and target sequences to align where the first element of each pair is a query sequence and the second element of each pair is a target sequencestart_gap_cost
: The cost for starting a gapextend_gap_cost
: The cost for extending a gapstart_gap_cost_2
: The cost for starting a gap (second gap function)extend_gap_cost_2
: The cost for extending a gap (second gap function)scoring_matrix
: Scoring matrix of size M*M encoded as a linear vectorbandwidth
: The width of the band for banded alignment, use a negative number for full alignmentzdrop
: The z value for the z-drop heuristic, use a negative number to avoid z-dropend_bonus
: The bonus score to determine if the CIGAR of global query alignment should be returned instead of the CIGAR for extension alignment when extension_only_cigar is trueextension_only_cigar
: Whether to return the CIGAR for extension alignment/global query alignment (true) or the CIGAR for global alignment (false). If the alignment is z-dropped, the flag is not considered and the CIGAR for extension alignment is returned by default.
- Exceptions
std::invalid_argument
: In case of invalid sequences or scoring matrixinvalid_fpga_configuration
: If the target FPGA is not configured with the ksw2c_coreinvalid_fpga_identifier
: If an invalid FPGA identifier is specified
Variables
-
const uint64_t
KSW2C_MAX_SEQUENCE_SIZE
= 65536¶ The maximum supported sequence size
-
const uint64_t
KSW2C_MAX_SCORING_MATRIX_SIZE
= 64¶ The maximum supported size for the custom scoring matrix
-
const int32_t
KSW2C_MAX_NOT_FOUND
= -0x40000000¶ A special value assigned to a max score returned by the aligner when the max score is not found (such as when the alignment is z-dropped)
-
struct
ksw2c_response
¶ - #include <ksw2_cigar.hpp>
A structure that defines the result of the execution of a ksw2_cigar alignment on two pair of sequences
Public Members
-
int32_t
extension_max_score
¶ Max score of extension alignment:
query from 0 to [X]
target from 0 to [Y]
can be equal to KSW2C_MAX_NOT_FOUND if not found
-
int32_t
extension_query_end_position
¶ Query end position ([X]) for extension alignment
-
int32_t
extension_target_end_position
¶ Target end position ([Y]) for extension alignment
-
int32_t
global_query_max_score
¶ Max score of global query alignment:
query from 0 to query_len - 1
target from 0 to [Y]
can be equal to KSW2C_MAX_NOT_FOUND if not found
-
int32_t
global_query_target_end_position
¶ Target end position ([Y]) for global query alignment
-
int32_t
global_target_max_score
¶ Max score of global target alignment:
query from 0 to [X]
target from 0 to target_len - 1
can be equal to KSW2C_MAX_NOT_FOUND if not found
-
int32_t
global_target_query_end_position
¶ Query end position ([X]) for global target alignment
-
int32_t
global_max_score
¶ Max score of global alignment:
query from 0 to query_len - 1
target from 0 to target_len - 1
can be equal to KSW2C_MAX_NOT_FOUND if not found
-
std::vector<ksw2c_cigar_op>
cigar
¶ The CIGAR of one of the following alignments:
global alignment: if extension_only_cigar=false
global query alignment: if extension_only_cigar=true and global_query_max_score + end_bonus > extension_max_score
extension alignment: if extension_only_cigar=true and global_query_max_score + end_bonus <= extension_max_score
- Note
If the alignment has been z-dropped, the CIGAR of extension alignment is returned by default
-
bool
global_query_cigar
¶ Whether the CIGAR for global query alignment is returned
-
bool
zdropped
¶ Whether the alignment has been z-dropped
-
int32_t
-
typedef uint32_t
Usage examples¶
examples/ksw2_cigar/1_cigar_operations_counting¶
/**
* Copyright (c) 2021, Huxelerate S.r.l.
* All rights reserved.
*
* Sample code for computing the average number of matches/mismatches, insertions and deletions
* for global alignment between random DNA reads.
*/
#include <iostream>
#include <iomanip>
#include <string>
#include <sys/time.h>
#include <vector>
#include <utility>
#include "hugenomic.hpp" // HUGenomic library
using namespace hugenomic;
// number of pairs of sequences to generate and their lengths
const unsigned int NUM_PAIRS = 48;
const size_t QUERY_SEQUENCES_LEN = 4000;
const size_t TARGET_SEQUENCES_LEN = 4000;
// scoring system used by the sequence alignment algorithm
const int START_GAP_COST = 4;
const int EXTEND_GAP_COST = 2;
const int START_GAP_COST_2 = 13;
const int EXTEND_GAP_COST_2 = 1;
// in this example we use 5x5 scoring matrix, where we assume the following
// encoding for positions (rows/columns) within the matrix:
// 0 -> A, 1 -> C, 2 -> G, 3 -> T, 4 -> *
std::vector<int8_t> SCORING_MATRIX = {
3, -4, -4, -4, 0,
-4, 3, -4, -4, 0,
-4, -4, 3, -4, 0,
-4, -4, -4, 3, 0,
0, 0, 0, 0, 0
};
const int ZDROP = -1; // we do not use the z-drop heuristics
const int BANDWIDTH = -1; // we do full alignment (not banded)
const int END_BONUS = 0;
const int EXTENSION_ONLY_CIGAR = false;
// utility function to generate a random dna sequence with a given length
std::vector<uint8_t> random_dna_sequence(size_t length);
// utility function to retrieve the current time in microseconds
unsigned long get_time();
int main()
{
// initialize one FPGA with the accelerated alignment core
const int fpga_id = 0;
fpga_configure(ksw2_cigar, fpga_id);
// generate a random set of pairs of DNA reads
std::cout << "Generating " << NUM_PAIRS
<< " random pairs of DNA reads, each pair with read lengths ("
<< QUERY_SEQUENCES_LEN << ", " << TARGET_SEQUENCES_LEN << ")"
<< std::endl;
srand(0);
std::vector< std::pair< std::vector<uint8_t>, std::vector<uint8_t> > > sequences_pairs;
for(int s = 0; s < NUM_PAIRS; s++) {
std::vector<uint8_t> query_sequence = random_dna_sequence(QUERY_SEQUENCES_LEN);
std::vector<uint8_t> target_sequence = random_dna_sequence(TARGET_SEQUENCES_LEN);
sequences_pairs.push_back({query_sequence, target_sequence});
}
std::cout << "Generation complete!" << std::endl << std::endl;
// run the alignment on each pair of random reads
std::cout << "Running alignment..." << std::endl;
int total_score = 0;
int total_length = 0;
int total_match_mismatch = 0;
int total_deletion = 0;
int total_insertion = 0;
unsigned long start_compute_time = get_time();
std::vector<ksw2c_response> alignment_results = ksw2c_batch(fpga_id, sequences_pairs,
START_GAP_COST, EXTEND_GAP_COST, START_GAP_COST_2, EXTEND_GAP_COST_2,
SCORING_MATRIX, BANDWIDTH, ZDROP, END_BONUS, EXTENSION_ONLY_CIGAR);
unsigned long end_compute_time = get_time();
for(const auto &alignment: alignment_results) {
for(ksw2c_cigar_op cigar_op: alignment.cigar) {
char cigar_op_type = ksw2c_get_cigar_op_type(cigar_op);
uint32_t cigar_op_length = ksw2c_get_cigar_op_length(cigar_op);
if(cigar_op_type == 'M')
total_match_mismatch += cigar_op_length;
else if(cigar_op_type == 'I')
total_insertion += cigar_op_length;
else if(cigar_op_type == 'D')
total_deletion += cigar_op_length;
}
}
// print the average number of matches/mismatches, insertions and deletions
std::cout << std::fixed << std::setprecision(2);
std::cout << "Alignment complete!" << std::endl << std::endl;
std::cout << "Average number of matches/mismatches = "
<< total_match_mismatch / (double) NUM_PAIRS << std::endl << std::endl;
std::cout << "Average number of insertions = "
<< total_insertion / (double) NUM_PAIRS << std::endl << std::endl;
std::cout << "Average number of deletions = "
<< total_deletion / (double) NUM_PAIRS << std::endl << std::endl;
// measure execution time and performance
double time_ms = (end_compute_time - start_compute_time) / 1000.0;
size_t cell_updates = QUERY_SEQUENCES_LEN * TARGET_SEQUENCES_LEN * NUM_PAIRS;
std::cout << "Alignment time: " << time_ms << " ms" << std::endl
<< "GCUPS (Giga Cell Update Per Second) = "
<< (cell_updates / time_ms / 1000000.0) << std::endl;
// release the FPGA
fpga_release(fpga_id);
return 0;
}
std::vector<uint8_t> random_dna_sequence(size_t length)
{
std::vector<uint8_t> sequence;
for(size_t i = 0; i < length; i++) {
sequence.push_back( rand() % 5 );
}
return sequence;
}
unsigned long get_time()
{
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec * 1000 * 1000 + tv.tv_usec;
}