2012-03-07/Sébastien Boisvert Given a sequence, a sliding window can be utilised to generated k-mers. Sequence: ------------------------ k-mers: *************** *************** *************** *************** *************** *************** *************** *************** *************** *************** *************** This works for de novo assembly. But let's say we want to search for sequences in the de Bruijn graph. Precisely, we want to search for a sequence that have some mismatches in respect to the sequence used for de novo assembly. So this is the sequence we are searching for: -----------X------------ ***********X*** **********X**** *********X***** ********X****** *******X******* ******X******** *****X********* ****X********** ***X*********** **X************ *X************* Since X is not in the k-mers in the graph, the sequence won't be detected. One way to deal with this is to build an index of l-mers, parts of k-mers. Each MPI rank has an array of all possible l-mers. irb(main):001:0> 4**5 => 1024 irb(main):002:0> 4**10 => 1048576 == Adding a k-mer entry to a l-mer entry == add(l-mer,k-mer) index=convertLmerToInteger(l-mer,lmerLength) table[index].add(k-mer) The table is an array of managed k-mers (with defragmentation). == Searching for the closest k-mer in the graph == If there is more than 1 hit with the same lowest score, take any. search(k-mer) l-mers = GetLmersFromKmer(k-mer) hits={} for l-mer in l-mers k-mers = table[index] for j in k-mers hits[j]++ return best hit The obvious problem here is that an entry in the table will contain a lot of k-mers. Otherwise, a k-mer could be sent to all MPI ranks so that they can search for it. But that would not be efficient at all. Send-to-all communication patterns are just bad.