First we need to generate a BIOM file for each of the genes. This can be done in Qiita. First we have to set up a Woltka processing workflow.

Next, we need to download the BIOM file that is the result of our analysis. We can do that by clicking the add all button and downloading within qiita as shown below.

This .biom file contains a frequency table of gene counts that are labeled with the convention of genome#_ORF# (ie G000123456_789, meaning the 789th ORF on genome G000123456). In this example it can be found within the qiita_output directory

Our goal is to take this .biom file and turn it into a database of FASTA protein sequences. To do this we need to take the genome#_ORF# and map it to the corresponding uniref accession IDs.

If it hasn’t been done yet, lets make sure that we have the appropriate conda environment with the right packages set up to facilitate this analysis. We can do that as follows:

conda create -n woltka python=3
conda activate woltka
conda install -c conda-forge biom-format
conda install -c bioconda woltka

Next we can use use woltka to take the genome number genome_ORF (ie G000123456_789, meaning the 789th ORF on genome G000123456) and assign a uniref ID. In order to do this we need to download the mapping files that are associated with wol. We can do that here

Once these files are downloaded and in the same directory as our .biom file we can use the following command to conduct the mapping.

woltka tools collapse -i qiita_output/BIOM/153675/53069_analysis_Metagenomic_Woltkav011DatabaseoptgenomeqiitadbsqpwoltkawolWoLr1BIOMpergenebiom.biom -m wolr1_maps/function/uniref/uniref.map.xz -n wolr1_maps/function/uniref/uniref.name.xz -o protein_uniref.biom

This will output a .biom file (binary). In order to convert it to a .txt file we can run the following command:

biom convert -i protein_uniref.biom -o protein_uniref.txt --to-tsv

Now that we have this .txt file that has been mapped to uniref protein IDs we can read it into R.

d1 = read_delim("protein_uniref.txt",skip = 1)

Using the GLabR package we can now use the annotate_proteins() function to grab the protein sequences for each of the uniref protein IDS by hitting the uniref API.

Note that this is pulling ~350,000 sequences so it takes awhile to run.

library(GLabR)
pl = annotate_proteins(d1$`#OTU ID`,columns ="accession,id,protein_name,sequence")

write_delim(pl,"pl.txt")

This data is in a dataframe which is not the ideal format. We can transpose this into the correct .fasta file using the following code.

df = read_delim("pl.txt") %>% 
  filter(!is.na(Entry),!is.na(Sequence)) %>% 
  select(seq.name = Entry,seq.text = Sequence) %>% 
  as.data.frame()

library(phylotools)

dat2fasta(df,outfile = "out.fasta")

The code detailed above should be sufficient for making the necessary .fasta files!