miscellaneous

This page provides additional documentation in regards to AMDirT that is not directly related to the functioning of AMDirT commands themselves.

viewer

Downloading the sequencing data of selected libraries

AMDirT provides three different methods to download the sequencing data of selected libraries from public archives:

  • direct download from the FTP server using curl

  • direct download via the FASP protocol using ASPERA

  • indirect download via the Nextflow pipeline nf-core/fetchngs

Downloading via curl

cURL is a well established and popular tool curl for command line or script based data transfer. It is found on most modern operating UNIX based systems, and therefore it is the default downloading tool in AMDirT. However, it is the slowest of the three options as it runs over a standard HTTP/FTP connection, and is not parallelised (each file is downloaded sequentially).

In most cases you can assume it is already installed on your machine, however you can check you have cURL installed by running:

which curl

the output of which, should be something like /usr/bin/curl. If you get no output, you will need to look into installing the tool.

If you select curl in AMDirT viewer or AMDirT convert, you will recieve a bash script that contains curl command(s).

It will look like this:

curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR533/006/SRR5332466/SRR5332466.fastq.gz -o SRR5332466.fastq.gz

where you have the URL and the output file to save to.

By running bash ancientMetagenomeDir_curl_download_script.sh, the script will download each FASTQ file in the script one by one.

Downloading via the FASP protocol using ASPERA

FASP is a specific protocol that allows the download of large data files at a speed that is usually much higher than when downloading from the FTP server. It is particularly suitable when downloading very large data files. While much faster than curl, the aspera bash script generated by AMDirT still runs sequentially.

Prior to be able to download via this method, make sure that you have the ASPERA connect installed on your system (using which ascp). If this is not the case, please refer to this installation guide and download the binary from here. You can also install this via conda (conda create -n aspera -c HCC aspera-cli)

AMDirT viewer/convert will return a script that for each sequencing file looks like this following the recommendation from ENA:

ascp -QT -l 300m -P 33001 -i path/to/aspera/installation/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:path/to/sequencing/file local/target/directory
```

AMDirT will automatically replace `path/to/sequencing/file` to match the paths for the libraries that were selected. It will also set the `local/target/directory` to the current directory.

However, you will need to set the `path/to/aspera/installation` prior to running this. To make it more convenient, we opted for using the environment variable `ASPERA_PATH` that has to be set in the shell prior to running the script. Therefore, run:

```bash
ASPERA_PATH="$HOME/.aspera/cli"
```

> ⚠️ In case your institute blocks the port 33001, you will need to change the parameter `-P 33001` to another port that is not blocked.

Downloading via nf-core/fetchngs

nf-core/fetchngs is a Nextflow bioinformatics pipeline to fetch metadata and raw FastQ files from both public and private databases. At present, the pipeline supports SRA / ENA / DDBJ / Synapse ids. While it still runs over HTTPS, it supports directly downloading via AWS S3 servers and is highly parallelised - downloading multiple files at once.

You will need to install Nextflow and have it configured for your machine or cluster, as well as a software environment system such as conda, docker, or singularity.

The output from AMDirT viewer/convert will contain a list of accessions in a format compatible with the nf-core/fetchngs input file.

nextflow pull nf-core/fetchngs
nextflow run nf-core/fetchngs --input AncientMetagenomeDir_nf_core_fetchngs_input_table.tsv`