Taxonomy using proteins from TSA

DarrenObbard

New member
Hi!

I would like diamond to report on taxonomy from the TSA. These are nucleotide sequences, so I have generated my own putative translations. My input sequences for the database have headers/names like:

>GAAA01000001.1 1.Latimeria chalumnae

To provide the taxids for the database, I am using NCBI's nucl_wgs.accession2taxid.gz. This seems to be in a different format to (e.g.) acc_taxid_prot.gz, in that it has a header line and four columns. I have therefore modified it to remove the header, and retain only accession.version and taxid columns - so it now looks like acc_taxid_nucl.gz:

zcat corrected_nucl_wgs.accession2taxid.gz | head
AAAA00000000.2 39946
AAAA02000001.1 39946
AAAA02000002.1 39946
AAAA02000003.1 39946
AAAA02000004.1 39946
AAAA02000005.1 39946
AAAA02000006.1 39946
AAAA02000007.1 39946

I have confirmed that GAAA01000001.1 is in this file:

zcat corrected_nucl_wgs.accession2taxid.gz | grep -n GAAA01000001.1
238498520:GAAA01000001.1 7897

However, when I attempt to build the database, I get the following error:

/diamond-0.9.31/diamond makedb --threads 10 --taxonmap corrected_nucl_wgs.accession2taxid.gz --taxonnodes nodes.dmp --taxonnames names.dmp --in All_TSA_Prots.fas.gz -d tsa_protein_diamond
....
Loading taxonomy... [0s]
Error: Invalid taxonomy mapping file format.

One guess is that this is because some accessions (e.g. GAAA01000001.1) appear multiple times in All_TSA_Prots.fas.gz (some nucleotides have multiple proteins associated with them, e.g. >GAAA01000001.1 2.Latimeria chalumnae ) but if this is the cause, then the error message is misleading. I also cannot see why this would be a problem.

How should I attach a taxonomy to these sequences?

Thanks and best wishes,

Darren
 

Benjamin Buchfink

Administrator
Staff member
The file is expected to have at least 3 columns. The first one is ignored, the second one is the accession, and the third one is the taxid. Be aware that accessions can't be longer that 14 characters. If they are, you need to modify the source code slightly (I can help you with that if needed). Also note that multiple entries for the same accession will not be handled correctly.
 

DarrenObbard

New member
Hi!

Thank you - it looks like lengths of 15 and 17 are common, and occasionally 16.

Can this be fixed simply by editing line 45 of taxonomy.h to be
enum { max_accesion_len = 17 };
Or is there something more subtle elsewhere?

Thanks!

D
 

DarrenObbard

New member
And for "Also note that multiple entries for the same accession will not be handled correctly", do you mean multiple entries in the taxid mapping file, or multiple entries in the input sequence list, or both?
 

Benjamin Buchfink

Administrator
Staff member
You also need to change taxonomy.cpp:83. Change %15s accordingly (to length+1).

I meant multiple entries in the accession to taxid mapping file.
 

DarrenObbard

New member
Thank you! I was not going to find taxonomy.cpp:83 easily!

I have been using the precompiled binary, so I might be about to run into a lot of compilation errors ...

Thanks!

D

PS - the current distribution of accession lengths in nucl_wgs.accession2taxid is:

1518641310
38085663314
5180012315
716
2952001017
3472118
91167120
 
Top