1. Tutorial

Tutorial

We will demonstrate an example run of DIAMOND on a small dataset here. For this we assume the tool has been installed according to the installation instructions.

First we will download the SCOPe database of structured domains in FASTA format, containing 14,323 sequences:

wget https://scop.berkeley.edu/downloads/scopeseq-2.07/astral-scopedom-seqres-gd-sel-gs-bib-40-2.07.fa

The next step is to setup a binary DIAMOND database file that can be used for subsequent searches against the database:

diamond makedb --in astral-scopedom-seqres-gd-sel-gs-bib-40-2.07.fa -d astral40

A binary file (astral40.dmnd) containing the database sequences will be created in the current working directory. We will now conduct a search against this database using the same FASTA file of SCOPe domains as queries:

diamond blastp -q astral-scopedom-seqres-gd-sel-gs-bib-40-2.07.fa -d astral40 -o out.tsv --very-sensitive

Here we set the output file to be out.tsv and also use the --very-sensitive setting. DIAMOND has a number of sensitivity settings to accomodate different applications. The default mode is the fastest and tailored towards finding homologies of >70% sequence identity, the --sensitive mode is tailored to hits of >40% identity, while the --very-sensitive and --ultra-sensitive modes provide sensitivity accross the whole range of pairwise alignments.

If the run completed successfully, at the end we will see this console output providing some statistics about the number of hits that were found:

Total time = 5.494s
Reported 69284 pairwise alignments, 69284 HSPs.
14323 queries aligned.

We will inspect the beginning of the output file like so:

head out.tsv

Here we will see the first ten lines of the file, showing 10 pairwise alignments of the first 3 query sequences:

d1dlwa_ d1dlwa_ 100.0   116     0       0       1       116     1       116     1.1e-58 220.7
d1dlwa_ d2gkma_ 35.4    113     73      0       1       113     13      125     1.4e-16 80.9
d1dlwa_ d4i0va_ 31.9    119     75      2       1       113     2       120     9.5e-10 58.2
d1dlwa_ d6bmea_ 32.5    114     73      1       1       110     2       115     2.3e-08 53.5
d2gkma_ d2gkma_ 100.0   127     0       0       1       127     1       127     5.4e-67 248.4
d2gkma_ d1dlwa_ 34.8    115     75      0       13      127     1       115     1.4e-17 84.3
d2gkma_ d4i0va_ 33.6    110     69      1       13      118     2       111     2.4e-14 73.6
d2gkma_ d6bmea_ 35.5    110     67      1       13      118     2       111     7.7e-13 68.6
d2gkma_ d2bkma_ 37.3    67      38      2       13      76      5       70      1.7e-04 40.8
d1ngka_ d1ngka_ 100.0   126     0       0       1       126     1       126     1.2e-69 257.3

The file is generated in tabular-separated (TSV) format composed of 12 fields, a format corresponding to the format generated by BLAST using the option -outfmt 6. The 12 fields are:

  1. Query accession: the accession of the sequence that was the search query against the database, as specified in the input FASTA file after the > character until the first blank.
  2. Target accession: the accession of the target database sequence (also called subject) that the query was aligned against.
  3. Sequence identity: The percentage of identical amino acid residues that were aligned against each other in the local alignment.
  4. Length: The total length of the local alignment, which including matching and mismatching positions of query and subject, as well as gap positions in the query and subject.
  5. Mismatches: The number of non-identical amino acid residues aligned against each other.
  6. Gap openings: The number of gap openings.
  7. Query start: The starting coordinate of the local alignment in the query (1-based).
  8. Query end: The ending coordinate of the local alignment in the query (1-based).
  9. Target start: The starting coordinate of the local alignment in the target (1-based).
  10. Target end: The ending coordinate of the local alignment in the target (1-based).
  11. E-value: The expected value of the hit quantifies the number of alignments of similar or better quality that you expect to find searching this query against a database of random sequences the same size as the actual target database. This number is most useful for measuring the significance of a hit. By default, DIAMOND will report all alignments with e-value < 0.001, meaning that a hit of this quality will be found by chance on average once per 1,000 queries.
  12. Bit score: The bit score is a scoring matrix independent measure of the (local) similarity of the two aligned sequences, with higher numbers meaning more similar. It is always >= 0 for local Smith Waterman alignments.

Note that this output format can be customized with a number of non-default fields that are available. It is generally advisable to customize the format and limit it to the information required by downstream processing, as this may substantially increase performance. If no fields are selected that require alignment traceback (such as coordinates, length, identity, gap openings and mismatches), performance will be gained by omitting traceback computations.

Top