cgeb2001's DokuWiki!

This is an old revision of the document!

How to extract a single fasta file from a multifasta file

You have a multifasta file with 200 individual fasta files but you only want one of them. How do you extract it?

DO NOT open the file in a text editor and simply cut and paste the one you want. Unless you are very careful this approach can introduce hidden and unwanted characters that will cause programs expecting plain unix type text files to fail.

A better way is to use makeblastdb and blastdbcmd

makeblastdb -in name_of_multifasta_file -dbtype nucl -parse_seqids (for nucleotide files)

makeblastdb -in name_of_multifasta_file -dbtype prot -parse_seqids (for protein files)

identify the name of the entry you want. this will be the text string right after the > and before a space in the header For example, in the header “>m.260160 g.260160 ORF g.260160” the name is m.260160

blastdbcmd -db name_of_multifasta_file -entry name_of_entry_you_want -out name_of_single_entry

Of course there are many ways to extract a single fasta file from a multiple fasta file. What if you want to extract multiple entries from a fasta database?

You could use a single line perl script

perl -ne 'if(/^>(\S+)/){$c=grep{/^$1$/}qw(TRINITY_DN1246_c0_g1_i1 TRINITY_DN1246_c0_g1_i8)}print if $c' Trinity_sortedsize > tempfile where Trinity_sortedsize is the name of the multifasta file TRINITY_DN1246_c0_g1_i1 is the name of the first entry to get TRINITY_DN1246_c0_g1_i8 is the name of the second entry to get If you have lots of entries, to get them it is cumbersome to put all the names on a single line Better to put them in a file called something appropriate like filestoget with one entry name per line like this TRINITY_DN7501_c0_g1_i2 TRINITY_DN7173_c0_g1_i9 Then perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' filestoget Trinity_sortedsize > myextractedfiles

If you are on a system like perun you could use seqtk seqtk subseq Trinity_sortedsize filestoget > myextractedfiles where Trinity_sortedsize is the name of the multifasta file filestoget is the name of a plain text file that contains entry names, one per line myextractedfiles is the name of the file containing just the entries you want When you look at myextractedfiles you will notice that the entries are no longer nicely formatted but instead the sequence is all on one line. Normally this doesn't matter.

If you are on a system with EMBOSS installed you could use seqret To get just one entry seqret -sequence Trinity_sortedsize:TRINITY_DN7501_c0_g1_i2 -outfile myfileofinterest -auto where Trinity_sortedsize:TRINITY_DN7501_c0_g1_i2 is the name of the database followed by :followed by the name of the entry you want To get multiple entries all at once you need to put the names in a file called something appropriate like filestoget with one entry name per line like this Trinity_sortedsize:TRINITY_DN7501_c0_g1_i2 Trinity_sortedsize:TRINITY_DN7173_c0_g1_i9 Then seqret @filestoget -outseq myfiles -auto