Would it be efficient if the blastp option used the same reference over multiple inputs?

Frankster

New member
I am trying to make my pipeline more efficient.
I have multiple DIAMOND databases that I apply to a list of genomes (protein sequences annotated/found by prodigal) in a for loop. I am wondering if there is a benefit if I loop through the list of genomes per database, instead of looping through the databases per genome; because setting up the reference could be skipped when using the same database over and over?

I am not sure if this question makes sense.
 

Benjamin Buchfink

Administrator
Staff member
If you can combine your databases to a larger one, that would be better. Diamond is more efficient the larger the query and database files. That is within certain limits, if your files are a couple of GB in size you can also run it separately.
 

Frankster

New member
If I put all databases together it would result in a database of 2.4GB. Would that still be viable on an average computer? Will play around with the block size. Does the block size effect the accuracy or only the speed?
 
Last edited:

Benjamin Buchfink

Administrator
Staff member
Yes that would certainly be viable. The block size does not affect the accuracy, higher numbers will only increase speed, but also memory usage.
 
Top