Biopython教程

Biopython Entrez数据库

Biopython Entrez数据库详细操作教程

Entrez是NCBI提供的在线搜索系统。通过集成的全局查询，它支持布尔运算符和字段搜索，从而可以访问几乎所有已知的分子生物学数据库。它返回所有数据库的结果，并提供诸如每个数据库的命中次数，带有原始数据库链接的记录等信息。

下面列出了一些可以通过Entrez访问的流行数据库 -

Pubmed Pubmed Central Nucleotide(GenBank序列数据库) Protein(序列数据库) Genome(整个基因组数据库) Structure(三维高分子结构) Taxonomy(GenBank中的有机体) SNP(单核苷酸多态性) UniGene(转录序列的基因导向簇) CDD(保守蛋白质结构域数据库) 3D域(来自Entrez结构的域)

除上述数据库外，Entrez还提供更多数据库来执行字段搜索。Biopython提供了一个Entrez特定模块Bio.Entrez来访问Entrez数据库。下面将学习如何使用Biopython访问Entrez -

1. 数据库连接步骤

要添加Entrez的功能，请导入以下模块-

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-25
>>> from Bio import Entrez

接下来设置电子邮件以识别谁与下面给出的代码相关联 -

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-25
>>> Entrez.email = '<youremail>'

然后，设置Entrez工具参数，默认情况下为Biopython。

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-25
>>> Entrez.tool = 'Demoscript'

现在，调用einfo函数以查找索引术语计数，上次更新以及每个数据库的可用链接，如下所示-

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-25
>>> info = Entrez.einfo()

einfo方法返回一个对象，该对象通过read方法提供对信息的访问，如下所示 -

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-25
>>> data = info.read()
>>> print(data)
<?xml version = "1.0" encoding = "UTF-8" ?>
<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD einfo 20130322//EN"
   "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130322/einfo.dtd">
<eInfoResult>
   <DbList>
      <DbName>pubmed</DbName>
      <DbName>protein</DbName>
      <DbName>nuccore</DbName>
      <DbName>ipg</DbName>
      <DbName>nucleotide</DbName>
      <DbName>nucgss</DbName>
      <DbName>nucest</DbName>
      <DbName>structure</DbName>
      <DbName>sparcle</DbName>
      <DbName>genome</DbName>
      <DbName>annotinfo</DbName>
      <DbName>assembly</DbName>
      <DbName>bioproject</DbName>
      <DbName>biosample</DbName>
      <DbName>blastdbinfo</DbName>
      <DbName>books</DbName>
      <DbName>cdd</DbName>
      <DbName>clinvar</DbName>
      <DbName>clone</DbName>
      <DbName>gap</DbName>
      <DbName>gapplus</DbName>
      <DbName>grasp</DbName>
      <DbName>dbvar</DbName>
      <DbName>gene</DbName>
      <DbName>gds</DbName>
      <DbName>geoprofiles</DbName>
      <DbName>homologene</DbName>
      <DbName>medgen</DbName>
      <DbName>mesh</DbName>
      <DbName>ncbisearch</DbName>
      <DbName>nlmcatalog</DbName>
      <DbName>omim</DbName>
      <DbName>orgtrack</DbName>
      <DbName>pmc</DbName>
      <DbName>popset</DbName>
      <DbName>probe</DbName>
      <DbName>proteinclusters</DbName>
      <DbName>pcassay</DbName>
      <DbName>biosystems</DbName>
      <DbName>pccompound</DbName>
      <DbName>pcsubstance</DbName>
      <DbName>pubmedhealth</DbName>
      <DbName>seqannot</DbName>
      <DbName>snp</DbName>
      <DbName>sra</DbName>
      <DbName>taxonomy</DbName>
      <DbName>biocollections</DbName>
      <DbName>unigene</DbName>
      <DbName>gencoll</DbName>
      <DbName>gtr</DbName>
   </DbList>
</eInfoResult>

数据为XML格式，要获取数据作为python对象，请在调用Entrez.einfo()方法后立即使用Entrez.read方法-

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-25
>>> info = Entrez.einfo()
>>> record = Entrez.read(info)

在这里，record是一本字典，它具有一个DbList键，如下所示-

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-25
>>> record.keys()
[u'DbList']

访问DbList键返回数据库名称的列表，如下所示 -

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-25
>>> record[u'DbList']
['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'nucgss',
   'nucest', 'structure', 'sparcle', 'genome', 'annotinfo', 'assembly',
   'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar',
   'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles',
   'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim',
   'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay',
   'biosystems', 'pccompound', 'pcsubstance', 'pubmedhealth', 'seqannot',
   'snp', 'sra', 'taxonomy', 'biocollections', 'unigene', 'gencoll', 'gtr']
>>>

基本上，Entrez模块解析Entrez搜索系统返回的XML，并将其提供为python字典和列表。

2. 搜索数据库

要搜索任何一个Entrez数据库，需要使用Bio.Entrez.esearch()模块。它定义如下 -

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-25
>>> info = Entrez.einfo()
>>> info = Entrez.esearch(db = "pubmed",term = "genome")
>>> record = Entrez.read(info)
>>>print(record)
DictElement({u'Count': '1146113', u'RetMax': '20', u'IdList':
['30347444', '30347404', '30347317', '30347292',
'30347286', '30347249', '30347194', '30347187',
'30347172', '30347088', '30347075', '30346992',
'30346990', '30346982', '30346980', '30346969',
'30346962', '30346954', '30346941', '30346939'],
u'TranslationStack': [DictElement({u'Count':
'927819', u'Field': 'MeSH Terms', u'Term': '"genome"[MeSH Terms]',
u'Explode': 'Y'}, attributes = {})
, DictElement({u'Count': '422712', u'Field':
'All Fields', u'Term': '"genome"[All Fields]', u'Explode': 'N'}, attributes = {}),
'OR', 'GROUP'], u'TranslationSet': [DictElement({u'To': '"genome"[MeSH Terms]
or "genome"[All Fields]', u'From': 'genome'}, attributes = {})], u'RetStart': '0',
u'QueryTranslation': '"genome"[MeSH Terms] or "genome"[All Fields]'},
attributes = {})
>>>

如果分配了错误的数据库，那么它将返回 -

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-25
>>> info = Entrez.esearch(db = "blastdbinfo",term = "books")
>>> record = Entrez.read(info)
>>> print(record)
DictElement({u'Count': '0', u'RetMax': '0', u'IdList': [],
u'WarningList': DictElement({u'OutputMessage': ['No items found.'],
   u'PhraseIgnored': [], u'QuotedPhraseNotFound': []}, attributes = {}),
   u'ErrorList': DictElement({u'FieldNotFound': [], u'PhraseNotFound':
      ['books']}, attributes = {}), u'TranslationSet': [], u'RetStart': '0',
      u'QueryTranslation': '(books[All Fields])'}, attributes = {})

如果要跨数据库搜索，则可以使用Entrez.egquery。它与Entrez.esearch相似，只不过它足以指定关键字并跳过数据库参数。

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-25
>>>info = Entrez.egquery(term = "entrez")
>>> record = Entrez.read(info)
>>> for row in record["eGQueryResult"]:
... print(row["DbName"], row["Count"])
...
pubmed 458
pmc 12779 mesh 1
...
...
...
biosample 7
biocollections 0

3. 提取记录

Enterz提供了一种特殊的方法，即从Entrez检索和下载记录的全部详细信息。考虑以下简单示例 -

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-25
>>> handle = Entrez.efetch(
db = "nucleotide", id = "EU490707", rettype = "fasta")

现在，可以简单地使用SeqIO对象读取记录：

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-25
>>> record = SeqIO.read( handle, "fasta" )
>>> record
SeqRecord(seq = Seq('ATTTTTTACGAACCTGTGGAAATTTTTGGTTATGACAATAAATCTAGTTTAGTA...GAA',
SingleLetterAlphabet()), id = 'EU490707.1', name = 'EU490707.1',
description = 'EU490707.1
Selenipedium aequinoctiale maturase K (matK) gene, partial cds; chloroplast',
dbxrefs = [])

找工作要求35岁以下，35岁以上的程序员都干什么去了？

长久以来，一直有一个问题困扰着技术人——如何打破“程序员的35岁职业魔咒”，这一天迟早会到来，或早或晚。

或许是选错了行业，程序员薪水虽高，但光鲜的外表下，背后的苦衷只有自己知道。三十多岁本该是一个人事业的黄金期，但技术变化日新月异，行业竞争异常残酷，对一个企业来说，永远有比你更年轻、劳动成本更低的人可以选择，这让你的中年危机提前到来。破局的智慧可以看看这本书！>>

<< Biopython BLAST简介 Biopython PDB模块 >>

昵称：邮箱：