Quick-start guide

Installation

To use mim-tRNAseq, it is recommended to install the package using conda, preferably in its own environment. Significant time and dependency-related improvements can be made to using conda for managing environment and installing mimseq using the Miniforge version of conda which oncludes optional use for Mamba. We recommend installing Miniforge and then following the steps below:

conda create -n mimseq python=3.7
conda activate mimseq
mamba install -c bioconda mimseq

usearch needs to be acquired and installed. Please do the following:

wget https://drive5.com/downloads/usearch10.0.240_i86linux32.gz
gunzip usearch10.0.240_i86linux32.gz
chmod +x usearch10.0.240_i86linux32
mv usearch10.0.240_i86linux32 usearch
cp usearch /usr/local/bin

For this last cp command, root access is required. However, if this is not possible, please add the path of the usearch binary to your PATH (replace full/path/to/usearch with location of your usearch binary from above):

export PATH=$PATH:full/path/to/usearch

Alternatively, mim-tRNAseq can be installed with pip, in which case all additional non-python package dependencies (including usearch as above, BLAST, infernal, GMAP/GSNAP, and all required R packages) will also need to be installed.

pip install mimseq

The source code is also available on GitHub

Once installed, mim-tRNAseq should be executable and help displayed, by running

mimseq --help

The package also comes with a data/ folder which has the required tRNAscan-SE input files (and mitochondrial/plastid tRNA inputs where available) for a few species. Note that data folders containing “eColitK” in the name contain the E. coli Lys-TTT reference used as a spike-in in the paper. Using this reference in an experiment without this spike-in should not affect the results. Therefore, default inputs when using the –species are the references including E. coli Lys-TTT sequence.

Dependencies

When using conda to install mim-tRNAseq, all dependencies below are managed and installed automatically. We therefore strongly recommend using conda to install mim-tRNAseq in a separate environment (see Installation above). If you install from source or PyPi, please install all dependencies below before running mim-tRNAseq. In most cases, newer versions of the packages should be fine, but if you encounter any errors when running, first try to install the exact versions of dependencies listed below.

Unix command line dependencies:

Tool

Version

Link

GMAP-GSNAP

2019-02-26

GSNAP

samtools

>=1.11

samtools

usearch

10.0.240

usearch

bedtools

>=2.30.0

bedtools

INFERNAL

>=1.1.4

INFERNAL

BLAST

2.10.1

BLAST

gcc

4.8.5

gcc

Required R packages:

Package

Version

Link

R base

>=4

R

DESeq2

>=1.26.0

DESeq2

RColorBrewer

1.1.2

RColorBrewer

pheatmap

>=1.0.12

pheatmap

calibrate

>=1.7.7

calibrate

gridExtra

>=2.3

gridExtra

plyr

>=1.8.6

plyr

dplyr

>=1.0.6

dplyr

reshape2

>=1.4.3

reshape2

circlize

>=0.4.8

circlize

tidyverse

>=1.3.0

tidyverse

ComplexHeatmap

>=2.2.0

ComplexHeatmap

devtools

>=2.4.1

devtools

ggplot2

>=3.3.5

ggplot2

ggpol

>= 0.0.7

ggpol

Required Python packages:

Package

Version

Link

Python

=3.7

Python

Biopython

>=1.79

Biopython

pyfiglet

>=0.8.post1

pyfiglet

pysam

>=0.16.0.1

pysam

pandas

>=1.3.1

pandas

numpy

>=1.21.1

NumPy

seaborn

>=0.11.1

seaborn

pybedtools

>=0.8.2

pybedtools

requests

>=2.26.0

requests

Usage

An example command to run mim-tRNAseq may look as follows. This will run an analysis between HEK293T and K562 cells on an example dataset included in the package:

mimseq --species Hsap --cluster-id 0.97 --threads 15 --min-cov 0.0005 --max-mismatches 0.075 --control-condition HEK293T -n hg38_test --out-dir hg38_HEK239vsK562 --max-multi 4 --remap --remap-mismatches 0.05 sampleData_HEKvsK562.txt

The run should take around 15 minutes on a server using 15 processors (–threads 15: please adjust according to your server capabilities).

Input files

Note: mim-tRNAseq does not require an input from Modomics for modification indexing, but automatically connects to the Modomics server and retrieves this information. Therefore an internet connection is required to run mim-tRNAseq. However, there is an offline copy of Modomics so that mim-tRNAseq can still run without connection, or if the Modomics database is offline.

mim-tRNAseq requires a few input files depending on the species of interest. Data for some of these species is already present in the data/ folder and can be specified easily with the –species parameter (see Pre-built references below for available references). If not here, you may be able to obtain the required files from the GtRNAdb, or request new predictions from the maintainers if your species of interest is not there. Failing this, the input files can be generated using tRNAscanSE on a genome reference file, but the annotation and naming of tRNAs becomes crucial for mim-tRNAseq functioning. Information on the tRNAscan-SE ID given in parantheses in the fasta file must match entries in the “.out” file for proper processing. This kind of manual prediction, annotation, and input into mim-tRNAseq can conceivably create many issues, as mim-tRNAseq expects files and annotations as thos formatted in GtRNADB files. This functionality has also not been extensively tested.

Input files include:

  • Genomic tRNA sequences: DNA sequences of tRNA loci in genome of interest in fasta format, including introns but excluding trailer and leader sequences.

  • tRNA “.out” file: contains important info about tRNA introns.

  • Experiment sample file: User-generated tab-delimited file with 2 columns. The first is the absolute path to trimmed tRNAseq reads. The second is the condition name, used to group replicates (e.g. WT or knock-out etc)

  • OPTIONAL mitochondrial and/or plastid (in case of plant species) tRNA sequences: Can be obtained from the mitotRNAdb if available. First, find the organism of interest in the “Search Database” tab, select all sequences for organism, choose “Send FASTA” in the drop-down at the bottom of the results, and click “Submit”. Or, for plant species, obtain sequences from PtRNAdb by going to “Search”, choosing “Mitochondrial” and/or Plastid” in “Search by Genome”, enabling “Search by Plant Name:” and searching for your species of interest. Download the results, and then reformat them to the correct format using the example convertPtRNAdbSearch.py script in the Arabidopsis thaliana data folder, making sure to change the file names in the script before running. Mitochondrial sequences can be specified to mim-tRNAseq with the -m or –mito-trnas parameter. Plastid sequences can be specified to mim-tRNAseq with the -p or –plastid-trnas parameter.

additionalMods.txt is automatically read in by mim-tRNAseq to add additional modifications to the modification index that may not be in Modomics yet. Some important modifications have already been added for certain species, mainly based on Clark et al. tRNA base methylation identification and quantification via high-throughput sequencing (2016), and Rafels-Ybern et al. Codon adaptation to tRNAs with Inosine modification at position 34 is widespread among Eukaryotes and present in two Bacterial phyla (2018).

Pre-built references

mimseq contains a few pre-built references which available to specify at runtime with –species. All of these references include the E. coli tRNA-Lys-TTT spike-in sequence as detailed in the original method (Behrens et al., 2021). Details on these references are given below:

Note:

The Hsap, Hsap19, and Mmus references were built using the bed file supplied in the GtRNAdb downloads, which can be obtained from “Download tRNAscan-SE Results” on a species page. This bed file represents the “High Confidence Set and Top 30 Hits in Each Isotype of Filtered Sets” according to GtRNAdb (for hg38 example see here). These predictions are reached by simply clicking “tRNA Predictions” on the left panel on a species page. We opted for this set of sequences to represent a less stringent set of tRNAs that might show expression despite filtering by tRNAScan-SE, thus allowing mimseq to filter unexpressed genes instead (using –min-cov).

To create these references (since the fasta file is not directly supplied by GtRNAdb for this set of tRNAs), we extracted the sequence from the genome using bedtools, and subsequently renamed and reformatted the sequence headers with a custom script, FastaHeadersforMimseq.py. This analysis can be recreated for another species or genome by following the README for mouse mm39 as an example. Be sure to edit FastaHeadersforMimseq.py to suite your needs.