3 minute read

Introduction

Gene function annotation, in simple terms, is the comparison of protein sequences extracted from the genome based on existing databases to obtain the corresponding information.

  • The Gene Ontology GO database was created to resolve the confusion of the definition of the same gene in different databases and the confusion of the functional definition of the same gene in different species.
  • GO is an internationally standardized gene function classification system, which provides a set of dynamically updated controlled vocabulary to comprehensively describe the attributes of genes and gene products in organisms.
  • GO has three ontologies, including, Molecular Function (MF), Cellular Component (CC) and Biological Process (BP), which describe the molecular function and cellular location of the gene product. molecular function, cellular location, and biological process involved.

Previously we obtained protein coding sequences through the genome Click, and now we want to construct a single-copy genome phylogenetic tree to analyze the evolutionary relationships among bacteria.

Begin

Obtain the protein coding sequences

  • First, we can download the bacterial protein coding sequences from any public databases.
  • Secondly, genome annotation was performed to obtain protein coding sequence

GO annotation

  1. Download NCBI GenBank Nr database.
  2. Reads alignment using blast or diamond.
  3. Nr and GO data association files.
  4. Corresponding GO terms to alignment results.

NCBI GenBank Nr database

Reads alignment using blast or diamond

diamond blastx --db diamond_makedb.dmnd --out result_fmt0.blastx --outfmt 0 --query query.fasta --threads 16 --evalue 1e-5 --max-target-seqs 1 --threads 12 --salltitles

Nr and GO data association files

In fact, many databases are already associated with GO. The associations and mappings between the gene ids in these databases are included in a file called idmapping.tb. Download

Corresponding GO terms to alignment results

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
import os
from datetime import datetime

def usage():
    print("------------------------------------------------------------------------------------------")
    print('Description: 将blast结果与GO信息关联。\n')
    print('Usage: python3 go_ann_setp_1.py [diamond_blast.txt] [diamond_blast.add_go.txt] [idmapping.tb]\n')
    print('Usage: python3 go_ann_setp_1.py [INPUT.TXT] [OUTP.TXT] [idmapping.tb]\n')
    print(" >>>> mqy <<<<")

def main():
    input = sys.argv[1]
    output = sys.argv[2]
    go_db = sys.argv[3]
    
    #唯一蛋白id
    protein = {}
    
    xls = open(input, 'r')
    xls.readline()
    for line in xls:
            #  NR id 位于第 6 列(即“Hit_name”)
            id = line.split('\t')[5]
            if id not in protein:
                    protein[id] = 1
    
    xls.close()
    print(len(protein))
    
    #选择go
    tmp = open('go_tmp.xls', 'w')
    go = open(go_db, 'r')
    for line in go:
            line = line.strip().split('\t')
            #  NR id 位于第 4 列(即“RefSeq”)
            #  GO将用;分割
            nr = line[3].split('; ')
            for id in nr:
                    #  GO:0046782
                    #  len(line) > 8 列数大于8
                    if id in protein and len(line) > 8:
                            #  GO id 位于第 8 列
                            print(f'{id}\t{line[7]}', file = tmp)
                            #  RefSeq \t GO id
    
    tmp.close()
    
    #合并
    go = {}
    tmp = open('go_tmp.xls', 'r')
    for line in tmp:
            line = line.strip().split('\t')
            if len(line) == 2:
                    go[line[0]] = line[1]
            elif len(line) == 1:
                    go[line[0]] = ''
    
    tmp.close()
    
    xls2 = open(output, 'w')
    xls = open(input, 'r')
    print(f'{xls.readline().strip()}\tGO', file = xls2)
    
    for line in xls:
            line = line.strip()
            id = line.split('\t')[5]
            if id in go:
                    print(f'{line.strip()}\t{go[id]}', file = xls2)
    
    xls.close()
    xls2.close()
    os.system('rm go_tmp.xls')

    now = datetime.now()
    print('Done!\n')
    print('Time: '+ now.strftime('%Y-%m-%d %H:%M:%S')+'\n')
    print('Output file: '+sys.argv[2]+'\n')
    print('\t>>> mqy <<<<\n')
    print('\t如果有任何问题请及时联系\n')
    print('\t邮箱:<15877464851@163.com>\n')

try:
    main()
except IndexError:
    usage()
  • run step 1
python3 go_ann_setp_1.py diamond_blast.txt diamond_blast.add_go.txt idmapping_selected.tb
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
import os
from datetime import datetime

def usage():
    print("------------------------------------------------------------------------------------------")
    print('Description: 将关联后的结果简化,并添加GO_trem。\n')
    print('Usage: python3 go_ann_setp_2.py [diamond_blast.add_go.txt] [go_anno.txt] [go.obo]\n')
    print('Usage: python3 go_ann_setp_2.py [INPUT.TXT] [OUTP.TXT] [go.obo]\n')
    print(" >>>> mqy <<<<")

def main():
    input = sys.argv[1]
    output = sys.argv[2]
    go_db = sys.argv[3]
    
    #
    go = {}
    go_db = open(go_db, 'r')
    for line in go_db:
            line = line.strip()
            if line[0:4] == 'id: ':
                    id = line.split('id: ')[1]
            if line[0:6] == 'name: ':
                    name = line.split('name: ')[1]
            if line[0:11] == 'namespace: ':
                    namespace = line.split('namespace: ')[1]
                    go[id] = [name, namespace]
    
    go_db.close()
    
    #
    output = open(output, 'w')
    print('Query_name\tGO_id\tGO_name\tGO_ontology', file = output)
    
    input = open(input, 'r')
    input.readline()
    for line in input:
            line = line.strip().split('\t')
            if len(line) == 14:
                    #  line[13]是GO号
                    for id in line[13].split('; '):
                            if id in go:
                                    print(f'{line[0]}\t{id}\t{go[id][0]}\t{go[id][1]}', file = output)
    
    input.close()
    output.close()

    now = datetime.now()
    print('Done!\n')
    print('Time: '+ now.strftime('%Y-%m-%d %H:%M:%S')+'\n')
    print('Output file: '+sys.argv[2]+'\n')
    print('\t>>> mqy <<<<\n')
    print('\t如果有任何问题请及时联系\n')
    print('\t邮箱:<15877464851@163.com>\n')

try:
    main()
except IndexError:
    usage()
  • run step 2
python3 go_ann_setp_2.py diamond_blast.add_go.txt go_anno.txt go.obo

Quote

Email me with more questions! 584338215@qq.com