The goal of this experiment is to see the differences in global gene expression upon addition of tryptophan.
We have noted that this halts activity of clbP via the activity based probe, but the mechanism is still unclear. Hopefully, this study will allow us to see if this effect is at the transcriptional level.
RNA seq was done in Andy’s lab.
First we need to check to make sure that the RNA seq data produced good reads. We can do that using fastqc as follows.
fastqc Fastq/*
mkdir Fastq/fastqc
mv Fastq/*fastqc* Fastq/fastqc
Next, we need to build an index to use with the bowtie2 aligner. Here the strain of E.coli that we used for our experiments was ATCC_25922, so we are using that genome as the reference
Note that the first time I tried this I used the genome sequencing results from ATCC one codex. I figured that this would be the way to go because they do assemblies with a combination of nanopore and Illumina (i thought the long reads probably lead to a better assembly than the 2013 one provided by NCBI). The problem with this was that the annotation file was in a gbk format, and I couldn’t figure out how to get it into a gtf file to use with featureCounts.
I thought that I would be able to use bp_genbank2gff.pl, but this didn’t format the file correctly.
As a result I decided to just start over using files downloaded from NCBI (refseq). Hopefully this will fix the problems that I was having.
bowtie2-build --threads 8 -f GCF_000401755.1_Escherichia_coli_ATCC_25922_genomic.fna ATCC_25922_bowtie2_index/ATCC_25922
conda activate rnaseq
cd Fastq
#could have done this in a loop but it was easier to just do it for each sample independently.
prefix="DFE_AAOSRB_1_"
postfix="_HFNCCDRX2.UDI"
ext="_noribo_clean.fastq.gz"
bowtie2 -x ../ATCC_25922_bowtie2_index/ATCC_25922 -1 $prefix"1"$postfix$"425"$ext -2 $prefix"2"$postfix$"425"$ext -p 8 -S AAO.sam
prefix="DFE_ABOSRB_1_"
postfix="_HFNCCDRX2.UDI"
ext="_noribo_clean.fastq.gz"
bowtie2 -x ../ATCC_25922_bowtie2_index/ATCC_25922 -1 $prefix"1"$postfix$"426"$ext -2 $prefix"2"$postfix$"426"$ext -p 8 -S ABO.sam
prefix="DFE_ACOSRB_1_"
postfix="_HFNCCDRX2.UDI"
ext="_noribo_clean.fastq.gz"
bowtie2 -x ../ATCC_25922_bowtie2_index/ATCC_25922 -1 $prefix"1"$postfix$"427"$ext -2 $prefix"2"$postfix$"427"$ext -p 8 -S ACO.sam
prefix="DFE_ADOSRB_1_"
postfix="_HFNCCDRX2.UDI"
ext="_noribo_clean.fastq.gz"
bowtie2 -x ../ATCC_25922_bowtie2_index/ATCC_25922 -1 $prefix"1"$postfix$"468"$ext -2 $prefix"2"$postfix$"468"$ext -p 8 -S ADO.sam
prefix="DFE_AEOSRB_1_"
postfix="_HFNCCDRX2.UDI"
ext="_noribo_clean.fastq.gz"
bowtie2 -x ../ATCC_25922_bowtie2_index/ATCC_25922 -1 $prefix"1"$postfix$"429"$ext -2 $prefix"2"$postfix$"429"$ext -p 8 -S AEO.sam
prefix="DFE_AFOSRB_1_"
postfix="_HFNCCDRX2.UDI"
ext="_noribo_clean.fastq.gz"
bowtie2 -x ../ATCC_25922_bowtie2_index/ATCC_25922 -1 $prefix"1"$postfix$"430"$ext -2 $prefix"2"$postfix$"430"$ext -p 8 -S AFO.sam
Okay, now we have .SAM files for each of the samples where the reads have been mapped to the reference genome. Now we need to convert these to binary (BAM) files using SAM tools. We can do that as follows:
cd Fastq
samtools view -S -b -@ 8 AAO.sam > AAO.bam
samtools view -S -b -@ 8 ABO.sam > ABO.bam
samtools view -S -b -@ 8 ACO.sam > ACO.bam
samtools view -S -b -@ 8 ADO.sam > ADO.bam
samtools view -S -b -@ 8 AEO.sam > AEO.bam
samtools view -S -b -@ 8 AFO.sam > AFO.bam
Now that we have the BAM files, we need to sort them by chromosome position so they are compatible with genome viewers. We can do that as described below:
cd Fastq
samtools sort AAO.bam -o AAO.sorted.bam -@ 8
samtools sort ABO.bam -o ABO.sorted.bam -@ 8
samtools sort ACO.bam -o ACO.sorted.bam -@ 8
samtools sort ADO.bam -o ADO.sorted.bam -@ 8
samtools sort AEO.bam -o AEO.sorted.bam -@ 8
samtools sort AFO.bam -o AFO.sorted.bam -@ 8
Now that they are sorted, we need to make indexes so that we can extract alignments overlapping particular genomic regions. Also, this will be required to view these files in genome browsers like IGV.
cd Fastq
samtools index -@ 8 AAO.sorted.bam
samtools index -@ 8 ABO.sorted.bam
samtools index -@ 8 ACO.sorted.bam
samtools index -@ 8 ADO.sorted.bam
samtools index -@ 8 AEO.sorted.bam
samtools index -@ 8 AFO.sorted.bam
Now we need to actually extract the genes and the counts from each sample. We can do this using featureCounts from subread
Is this data stranded? I asked Andy and it seems like it is probably not.
Note that this data is paired end so the -p parameter is included.
cd Fastq
featureCounts -a ../GCF_000401755.1_Escherichia_coli_ATCC_25922_genomic.gtf -o rna_seq.counts *.sorted.bam -T 8 -p -t gene
Note that the first time I tired this command I didn’t specify -t gene, and that meant that it defaulted to looking for exons in my gtf file. Only 72 features were labeled as exons, looking at the file showed that the features tended to be labeled as gene, hence the -t gene flag. This resulted in ~5000 features and > 90% successfully assigned alignments. See RNA_sea.counts.summary for more information.
I looked at IGV to see the read pileups around the colibactin island. Unfortunately, it looks like the colibactin island is split across different contigs, so I will plan to reprocess the RNA seq data using a different reference genome.
I decided to rerun the data using the ATCC reference genome. I think this is the better way to go because it seems to be a higher quality genome (generated more recently using a combination of short and long read sequencing technology-ultimately resulting in only 5 contigs for a strain that has 2 x plasmids). This should allow us to look at the genomic features without splitting them across contigs.
#Activating rnaseq environment
conda activate rnaseq
cd ATCC
#Building the index using ATCC genome
bowtie2-build --threads 8 -f Escherichia_coli_ATCC_25922.fasta bowtie_index/ATCC_25922
#building the .sam files
for i in {425,426,427,468,429,420}
do bowtie2 -x bowtie_index/ATCC_25922 -1 ../Fastq*_1_1*$i* -2 ../Fastq*_1_2*$i* -p 8 -S $i.sam;
done
#renaming
mv 425.sam AAO.sam
mv 426.sam ABO.sam
mv 427.sam ACO.sam
mv 468.sam ADO.sam
mv 429.sam AEO.sam
mv 430.sam AFO.sam
#Converting to bam
samtools view -S -b -@ 8 AAO.sam > AAO.bam
samtools view -S -b -@ 8 ABO.sam > ABO.bam
samtools view -S -b -@ 8 ACO.sam > ACO.bam
samtools view -S -b -@ 8 ADO.sam > ADO.bam
samtools view -S -b -@ 8 AEO.sam > AEO.bam
samtools view -S -b -@ 8 AFO.sam > AFO.bam
#Sorting .bam
samtools sort AAO.bam -o AAO.sorted.bam -@ 8
samtools sort ABO.bam -o ABO.sorted.bam -@ 8
samtools sort ACO.bam -o ACO.sorted.bam -@ 8
samtools sort ADO.bam -o ADO.sorted.bam -@ 8
samtools sort AEO.bam -o AEO.sorted.bam -@ 8
samtools sort AFO.bam -o AFO.sorted.bam -@ 8
#Indexing .bam
samtools index -@ 8 AAO.sorted.bam
samtools index -@ 8 ABO.sorted.bam
samtools index -@ 8 ACO.sorted.bam
samtools index -@ 8 ADO.sorted.bam
samtools index -@ 8 AEO.sorted.bam
samtools index -@ 8 AFO.sorted.bam
ATCC only offered a .gbk annotation file, and I couldn’t figure out how to use it with feature counts so I decided to run the genome annotation my self. First I attempted this using PROKKA
#Running prokka
prokka Escherichia_coli_ATCC_25922.fasta
cd PROKKA_08082022
#Converting gff to gtf for feature counts. -F flag to retain all info
gffread *.gff -T -F -o PROKKA_08082022_all.gtf
cd ../
#running featureCounts
featureCounts -a PROKKA_08082022/PROKKA_08082022_all.gtf -o PROKKA_08082022/rna_seq_annotated.counts *.sorted.bam -T 8 -p -t transcript
Looking closer at this data, I found that prokka didn’t annotate the colibactin genes with the correct names. As I read more, I learned that prokka is a pretty good solution for annotation, but it’s primary focus is speed. Since we are only annotating one genome here, I decided to use the NCBI’s official genome annotation pipeline - pgap. From all accounts this offers superior annotation.
To do this I had to make an input.yaml file, and a submol.yaml file that contained information about the files to use in the pipeline and metadata respectively. See the pgap wiki for more information.
cd ATCC/pgap
sudo ~/shared/settings/pgap/pgap.py -n -o annotation input.yaml --ignore-all-errors
# Changing from gff3 to gtf for feature count comparability
gffread annotation/annot.gff -T -F -o annot.gtf
cd ../
#Running featureCounts
featureCounts -a pgap/annot.gtf -o pgap/rna_seq.counts *.sorted.bam -T 8 -p -t transcript
Now we have the final processed data! Let’s analyze using DESEQ2.
library(dplyr)
library(DESeq2)
= readr::read_delim("ATCC/pgap/rna_seq.counts",skip = 1) %>%
cts rename_with(~ gsub(".sorted.bam","", .x)) %>%
::column_to_rownames("Geneid") %>%
tibbleselect(6:length(.)) %>%
as.matrix()
= readxl::read_excel("metadata.xlsx") %>%
md as.matrix()
= DESeqDataSetFromMatrix(countData = cts,
dds colData = md,
design= ~ Condition)
= DESeq(dds)
dds
= results(dds, contrast = c("Condition", "50mM_W","control"))
res
= tibble(gene_id = res@rownames, padj = res$padj, log2FoldChange = res$log2FoldChange) df
First let’s look at a PCA to get a idea of the overall data. This should let us know if the experiment generally worked as we would expect it to.
library(ggplot2)
library(ggprism)
# Input is a matrix of log transformed values
= rlog(dds, blind=T)
rld
plotPCA(rld, intgroup="Condition")+
theme_prism()+
theme(legend.position = "bottom")+
geom_label(aes(label = rld@colData$Sample_ID))
Figure 1. Here we can see that the data clusters as we would expect. The 50mM_W replicates cluster close together as do the control replicates. Notably the 50mM W treated group is more variable than the control group, especially across PC2.
Now let’s look at a volcano plot to see how many differentially abundant genes there are.
library(ggplot2)
library(ggprism)
= df %>%
p1 ggplot(aes(log2FoldChange,-log10(padj)))+
geom_point()+
geom_hline(yintercept = -log10(0.05),linetype = "dashed",color = "red")+
geom_vline(xintercept = 2,linetype = "dashed", color = "red")+
geom_vline(xintercept = -2,linetype = "dashed", color = "red")+
theme_prism()+
xlab("Log2FC(50mM_W/Control)")+
ggtitle("Volcano plot W/Control")
p1
Figure 2. Here we can see that there are many genes that show a significant difference between the two conditions and have a log2FC >= =1.
What are these genes though? Let’s look at a table of them sorted by those most up-regulated in response to the addition of tryptophan.
library(stringr)
= df %>%
sig filter(padj <= 0.005) %>%
filter(log2FoldChange >= 2 | log2FoldChange <= -2)
library(rtracklayer)
= import("ATCC/pgap/annot.gtf") %>%
gtf as.data.frame() %>%
filter(type == "CDS")
= inner_join(gtf,sig,by = c("transcript_id" = "gene_id")) %>%
up select(gene_id,gene,product,go_process,padj,log2FoldChange) %>%
arrange(-log2FoldChange)
::datatable(up) DT
Table 1 Here we can see the list sorted by genes most up-regulated when 50mM tryptophan was added. We can see that tryptophanase is the third most up-regulated gene- unsurprisingly when you have more tryptophan present E.coli needs to break it down more. This suggests that the experiment and RNA seq were technical successes.
Interestingly, genes in the cus operon are most up-regulated. This is particularly interesting because copper is required for colibactin to form cross-links with DNA. Perhaps there could be some connection there?
Copper could be a co-factor for clbP activity. If this was the case, than one would expect that an increase of copper efflux in the periplasm would reduce the amount of copper available for clbP activity, thus leading the the observed results of reduced activity via the activity based probe.
Alternatively, perhaps the Cus copper efflux system could be exporting pre-colibactin/colibactin out of the periplasm (including the activity based probe). From the literature I have read it seems like the Cus system is very specific for Cu and Ag, but the way that colibactin is exported from the periplasm to extra-cellular space has not been described yet.
The logic behind why the addition of tryptophan would be increasing the need for copper efflux is not entirely clear to me. It has been reported that under limiting amino acid conditions copper efflux is increased in order to to protect Iron-Sulfur cluster enzymes here. They report that this is due to the absence of methionine which is a key intracellular Cu(I) chelator. Methionine’s absence contributed to the elevation of Cu(I) under a reduced environment of anaerobic growth. I am wondering if it would be possible that increased tryptophan concentrations led to a corresponding increase in methionine uptake and the correlated reduction in intracellular concentrations ultimately leading to copper toxicity and the need to increase efflux.
I think to see whether the cus operon is involved in the effects of tryptophan on clbP activity we need to knock out cusA and see if we still get a reduction in clbP activity upon adding tryptophan.
Now lets see what genes were more up-regulated in the controls (down-regulated upon tryptophan )
= up%>%
down arrange(log2FoldChange)
::datatable(down) DT
Table 2. Notably, the tryptophan operon is significantly down-regulated in response to W, unsurprisingly when tryptophan is added in access the cells don’t need to make it as much(even though it isn’t the most down regulated i.e. trpD log2FC of -4 - 19th most down). Once again, this shows that the experiment was a technical success.
What about the genes directly related to colibactin production? We didn’t see any of those on the lists. Since the activity based probe should be measuring clbP activity I would expect clbP gene expression to be downregulated if the effect of tryptophan is at the transcriptional level.
To see the differences in gene expression, let’s take a look at the read pileups across the entire island.
library("Gviz")
options(ucscChromosomeNames=FALSE)
= GeneRegionTrack("ATCC/pgap/annot.gtf", chromosome="1dad9b1ccf124c63_1",genome = "ATCC/pgap/Escherichia_coli_ATCC_25922.fasta",gene = "gene_name",geneSymbol = TRUE,symbol = "gene_name",name = "Features")
gm
= GenomeAxisTrack(showTitle=TRUE,name = "Colibactin pks island")
gtrack
= gm[feature(gm) %in% c("CDS")]
gm
= "ATCC/AAO.sorted.bam"
AAO = "ATCC/ABO.sorted.bam"
ABO = "ATCC/ACO.sorted.bam"
ACO = "ATCC/ADO.sorted.bam"
ADO = "ATCC/AEO.sorted.bam"
AEO = "ATCC/AAO.sorted.bam"
AFO
= DataTrack(range = AAO,
AAO_track type = "heatmap",
name = "W_AAO",
window = -1,
chromosome = "1dad9b1ccf124c63_1")
= DataTrack(range = ABO,
ABO_track type = "heatmap",
name = "W_ABO",
window = -1,
chromosome = "1dad9b1ccf124c63_1")
= DataTrack(range = ACO,
ACO_track type = "heatmap",
name = "W_ACO",
window = -1,
chromosome = "1dad9b1ccf124c63_1")
= DataTrack(range = ADO,
ADO_tracktype = "heatmap",
name = "Ctrl_ADO",
window = -1,
chromosome = "1dad9b1ccf124c63_1")
= DataTrack(range = AEO,
AEO_track type = "heatmap",
name = "Ctrl_AEO",
window = -1,
chromosome = "1dad9b1ccf124c63_1")
= DataTrack(range = AFO,
AFO_track type = "heatmap",
name = "Ctrl_AFO",
window = -1,
chromosome = "1dad9b1ccf124c63_1")
plotTracks(list(gtrack,gm,AAO_track,ABO_track,ACO_track,ADO_track,AEO_track,AFO_track),from=1980000, to=2035000,transcriptAnnotation="symbol",ylim =c(0,5000),just.group = "above",title.width = 2,main = "Colibactin pks island")
Figure 3. Here we can see that there are no notable differences in the colibactin island expression levels between the tryptophan treated and control cultures. Interestingly, clbR seems to be the most expressed gene in the island. ClbR is the key transcriptional activator of colibactin. Of note, AEO seems to have differences in expression compared to the other two replicates, clbR is less expressed and clbB is more expressed in that case. Looking back at the PCA in figure 1 we can see that AEO seemed to cluster farther away from the other two replicates. This highlights how there is more variability in the t
Let’s make a function to make a plot like the one above easily and then use it to look at the Cus operon in a similar way.
= function(gtf_file,start_gene,end_gene,sorted.bam_files,scale_height,title){
plot_genome_features #Setting options
options(ucscChromosomeNames=FALSE)
#importing gtf file to extract coordinates from
= rtracklayer::import(gtf_file) %>%
gtf_data as.data.frame() %>%
filter(type == "CDS")
# determining start and end coords
= gtf_data %>%
df filter(gene_name %in% c(start_gene,end_gene))
= min(df$start)
start = max(df$end)
end #What chromosome?
= unique(as.vector(df$seqnames))
chromosome #Genome annotation
= GeneRegionTrack(gtf_file,
ga chromosome=chromosome,
gene = "gene_name",
geneSymbol = TRUE,
ymbol = "gene_name",
name = "Features",
labelPos="above")
#gene axis
= GenomeAxisTrack(name = title)
gtrack
#bamlist
= purrr::map(sorted.bam_files, .f = function(.x){unlist(DataTrack(range = .x,
bam_list type = "heatmap",
name = gsub(".*/","",.x) %>% gsub(".sorted.bam","",.),
window = -1,
chromosome = "1dad9b1ccf124c63_1"))})
= list(gtrack,ga,unlist(bam_list))
tracks
plotTracks(c(gtrack,ga,bam_list),from=start, to=end,transcriptAnnotation="symbol",ylim =c(0,scale_height),just.group = "above",main = title)
}
plot_genome_features("ATCC/pgap/annot.gtf",
start_gene = "cusA",
end_gene = "cusS",
sorted.bam_files = c("ATCC/AAO.sorted.bam",
"ATCC/ABO.sorted.bam",
"ATCC/ACO.sorted.bam",
"ATCC/ADO.sorted.bam",
"ATCC/AEO.sorted.bam",
"ATCC/AFO.sorted.bam"),
scale_height = 3000,
title = "Cus operon")
Figure 4. Here we can see that the cus operon is up-regulated after addition of tryptophan. Appears to be across all genes except for cusS which is a sensor histidine kinase that functions as a copper/silver ion sensor. It then activates CusR by phosphorylation. It is interesting that clbS is not up-regulated here but other parts of the operon are.
Now lets look at functional profiles by looking at gene functions that were up regulated in response to tryptophan.
= df %>%
up filter(padj <= 0.005) %>%
filter(log2FoldChange > 0)
library(rtracklayer)
= import("ATCC/pgap/annot.gtf") %>%
gtf as.data.frame() %>%
filter(type == "CDS")
= inner_join(gtf,up,by = c("transcript_id" = "gene_id"))
diff
library(clusterProfiler)
library("org.EcK12.eg.db")
= diff$gene_name
gene
= bitr(gene, fromType = "SYMBOL",
gene.df toType = "ENTREZID",
OrgDb = org.EcK12.eg.db)
= enrichGO(gene = gene.df$ENTREZID,
ego OrgDb = org.EcK12.eg.db,
keyType = 'ENTREZID',
ont = "ALL",
pAdjustMethod = "fdr",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05)
library(enrichplot)
dotplot(ego)+
ggtitle("Upregulated in Response to Tryptophan")
Figure 5. Here we can see that ion transport is most up-regulated function. This is interesting because it seems related to the copper efflux up-regulation. Once again, why tryptophan would be increasing expression of genes with these functions is not immediately clear to me.
Now let’s do a network analysis. This will allow us to visualize all of the genes
= setReadable(ego, 'org.EcK12.eg.db', 'ENTREZID')
edox
= cnetplot(edox)+
p1 ggtitle("Upregulated in Response to Tryptophan")+
theme(plot.title = element_text(hjust = 0.5))
p1
Figure 6. Here we can see the network of up-regulated genes by function. There are two discrete clusters, arginine bio-synthetic process and then ion transport cluster (composed of several sub-clusters).
Now let’s look at the genes that were most downregulated in tryptophan treated cells relative to control.
= df %>%
down filter(padj <= 0.005) %>%
filter(log2FoldChange < 0)
library(rtracklayer)
= import("ATCC/pgap/annot.gtf") %>%
gtf as.data.frame() %>%
filter(type == "CDS")
= inner_join(gtf,down,by = c("transcript_id" = "gene_id"))
diff
library(clusterProfiler)
library("org.EcK12.eg.db")
= diff$gene_name
gene
= bitr(gene, fromType = "SYMBOL",
gene.df toType = "ENTREZID",
OrgDb = org.EcK12.eg.db)
= enrichGO(gene = gene.df$ENTREZID,
ego OrgDb = org.EcK12.eg.db,
keyType = 'ENTREZID',
ont = "BP",
pAdjustMethod = "fdr",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05)
library(enrichplot)
dotplot(ego)+
ggtitle("Upregulated in Control")
Figure 7. Here we can see that the mot changed thing is cellular amino acids. This makes sense as the addition of tryptophan is bound to cause the cell to need to change its cellular amino acid pools. Interestingly there are also many more protein secretion systems that seem to be up-regulated.
Now let’s look at a network.
= setReadable(ego, 'org.EcK12.eg.db', 'ENTREZID')
edox
= cnetplot(edox)+
p1 ggtitle("Upregulated in Control")+
theme(plot.title = element_text(hjust = 0.5))
p1
Figure 8. Here we can see the genes. There ware two main clusters, a cluster having to do with aromatic amino acid biosynthesis, and another having to do with protein secretion.