Bacterial gene prediction as well as pseudogene prediction!

3 minute read

Introduction

Now, we will genome annotate the bacterial genome, obtain protein coding sequences.

These data can then be used for functional gene annotation, phylogenetic analysis, etc.

Data preparation

Obtain the Genome Sequence

First, we can download the bacterial genome sequence from NCBI or other sources.
Secondly, through high-throughput sequencing data (HTS) technology, we can obtain the sequence of the bacterial genome.

Preparing genome name files

ls Genome/ | cut -d '.' -f 1 > head.txt

Genome prediction workflow

#!/bin/bash/
# **************************************************
# 共生菌基因预测以及去除假基因
# Date   : 2023-09-25
# Author : 孟庆瑶
# **************************************************

#  发生错误停止脚本
set -e

#  获取脚本名称并输出
name=$(basename $0) 
echo 
echo "   脚本名称: $name"

#  设置参数以及处理
OUTPREFIX="Bacterial_results"

while getopts hi:g:o: opt
do
    case "$opt" in 
    h)
        echo
        echo "脚本说明: 基因预测、假基因预测和获得去除假基因的CDs序列，(*￣︶￣)，东西不多，一字一字看！"
        echo
        echo "使用说明: bash $name -i genome_file_name -g genome_fasta -o out_prefix"
        echo -e "详细使用：https://mengqy2022.github.io/genomics/genome-prediction/"
        echo "------------------------------------------------------------------------------------"
        echo -e "\t-i: 输入文本文件、包含基因文件名称、不带后缀、单列文件，一列为一个基因组；"
        echo -e "\t-g: 基因文件夹名称，文件夹中可以包含一个或者多个基因组；"
        echo -e "\t-o: 文件前缀, 默认:去除假基因序列；"
        echo "`date '+Date: %D %T'`"
        echo -e "\n             >>>> mqy <<<< \n "
        echo "     !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
        echo "     !!!!   保证所有文件都在一个工作目录下   !!!!"
        echo "     !!!!          -g:为文件夹名称           !!!!"
        echo "     !!!!    -i内容和-g中的文件名称要相同    !!!!"
        echo "     !!!!-g文件夹种存放的物种核苷酸基因组文件!!!!"
        echo "     !!!!，文件名称与-i文件中相同。          !!!!"
        echo "     !!!! 输出结果中包含:                    !!!!"
        echo "     !!!!               基因预测文件;        !!!!"
        echo "     !!!!               假基因预测文件;      !!!!"
        echo "     !!!!               去除假基因序列文件。 !!!!"
        echo "     !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
        echo
        exit 0
        ;;
    i) 
        echo "输入-i文件为: $OPTARG"
        FILENAME=$OPTARG
        ;;
    g) 
        echo "输入-g文件为: $OPTARG"
        FOLDERNAME=$OPTARG
        ;;
    o) 
        echo "设置输出文件名前缀为: $OPTARG"
        OUTPREFIX=$OPTARG
        ;;
    *) 
        echo "未知参数: $opt"
        ;;
    esac
done

#####################-z 判断字符串是否为0

if [ -z $FILENAME ]
then
    echo -e "\n请输入-h，查看帮助文档！"
    exit 1
fi

############设置单个物种文件夹
for i in `cat ./$FILENAME` ;do mkdir ./$FOLDERNAME/${i}  &&  mv ./$FOLDERNAME/${i}.* ./$FOLDERNAME/${i} ;done

############Prokka
source /home/mengqingyao/miniconda3/bin/activate prokka

if [ -d "prokka" ]; then
    rm -rf prokka
fi

mkdir prokka && cd prokka

for i in `cat ../$FILENAME`; do prokka \
 ../$FOLDERNAME/${i}/${i}.* --cpus 0 --force --outdir ${i}_prokka --prefix ${i} --addmrna --compliant 200; done 

############pseudofinder

conda activate pseudofinder

if [ -d "../pseudogene" ]; then
    rm -rf ../pseudogene
fi

mkdir ../pseudogene && cd ../pseudogene

for i in `cat ../$FILENAME` ; do /home/mengqingyao/biosoftware/pseudofinder-master/pseudofinder.py \
annotate --genome ../prokka/${i}_prokka/${i}.gb* --outprefix ${i} \
-di --database /data/mengqy/database/diamond_nr/diamond_makedb.dmnd --threads 16 -e 1e-15 ;done

#############获得去除假基因的序

if [ -d "../remove_pseudogene" ]; then
    rm -rf ../remove_pseudogene
fi

mkdir ../remove_pseudogene && cd ../remove_pseudogene

for i in `cat ../$FILENAME` ; do cat ../prokka/${i}_prokka/${i}.ffn | grep ">" | cut -c 2- | cut -d " " -f 1 > ${i}_cds.ids && \
 awk '{if(match($0,"old_locus_tag=")) {print substr($0,RSTART+RLENGTH) }}' ../pseudogene/${i}_pseudos.gff \
 | sed 's/,/\t/g' | awk '{print $1}' | sed '/^$/d' >> ${i}_pseudofene.ids && comm -23 <(sort ${i}_cds.ids) <(sort ${i}_pseudofene.ids) > ${i}_remove.ids && \
 python3 /home/mengqingyao/Script_mqy/get_ids_fasta.py ../prokka/${i}_prokka/${i}.faa ${i}_remove.ids ${i}_remove_ids.faa && \
 python3 /home/mengqingyao/Script_mqy/get_ids_fasta.py ../prokka/${i}_prokka/${i}.ffn ${i}_remove.ids ${i}_remove_ids.ffn;done

if [ -d "../$OUTPREFIX" ]; then
    rm -rf ../$OUTPREFIX
fi

mkdir ../$OUTPREFIX 

for i in `cat ../$FILENAME` ; do mv ${i}_remove_ids.faa ../$OUTPREFIX && mv ${i}_remove_ids.ffn ../$OUTPREFIX; done

#  运行结束
echo "   运行结束！"
echo "   感谢使用本脚本！"
echo "   版本信息: v1.0"
echo "   日期: `date '+%Y-%m-%d %H:%M:%S'`"
echo "   作者: 孟庆瑶"
echo "   如有疑问请联系: <15877464851@163.com>"

If you want to use this script, you need to have the following software installed:

Prokka

Pseudofinder

Since the software is installed in a different way and in a different location, we need to make changes.

A python script in the process. get_ids_fasta.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys

script_name = sys.argv[0]

def usage():
    print('Usage: python {} [fasta_file] [namelist_file] [outfile_name]'.format(script_name))


def main():
    outf = open(sys.argv[3], 'w')
    dict = {}
    with open(sys.argv[1], 'r') as fastaf:
        for line in fastaf:
            if line.startswith('>'):
                name = line
                dict[name] = ''
            else:
                dict[name] += line.replace('\n', '')  # 读取整个fasta文件构成字典

    with open(sys.argv[2], 'r') as listf:
        for row in listf:
            row = row.strip()
            for key in dict.keys():  # 选取包含所需名称的基因名和序列
                if row in key:
                    outf.write(key)
                    outf.write(dict[key] + '\n')
    outf.close()


try:
    main()
except IndexError:
    usage()

The script is used to extract the sequences of the specified gene names from the fasta file.

Running the workflow

nohup bash /home/mengqy/ws/No_pseudogene_faa.sh -i head.txt -g Genome/ &

Wait for the completion of the job.

Result

We won’t explain the results of the different software in detail here, but if you’re interested, you can visit Prokka and Pseudofinder to find out.

Quote

Prokka

Pseudofinder

Email me with more questions! 584338215@qq.com

Share on

Twitter Facebook LinkedIn

Mengqingyao

Bacterial gene prediction as well as pseudogene prediction!

Introduction

Data preparation

Obtain the Genome Sequence

Preparing genome name files

Genome prediction workflow

Since the software is installed in a different way and in a different location, we need to make changes.

Running the workflow

Result

Quote

Share on

You May Also Enjoy

Gene prediction in eukaryotes is exemplified by Euplotes!

GenBank database download tool！

Extract Reads With Telomeric Reads!

KEGG annotation information collection and update！