DNA序列數(shù)據(jù)挖掘技術∗
朱揚勇1,2+, 熊赟1
1(復旦大學計算機與信息技術系,上海 200433)
2(上海生物信息技術研究中心,上海 201203)
DNA Sequence Data Mining Technique
ZHU Yang-Yong1,2+, XIONG Yun1
1(Department of Computer and Information Technology, Fudan University, Shanghai 200433, China)
2(Shanghai Center for Bioinformation Technology, Shanghai 201203, China)
+ Corresponding author: Phn: +86-21-65642831, Fax: +86-21-65642219, E-mail: yunx@fudan.edu.cn, http://www.dmgroup.org.cn
Zhu YY, Xiong Y. DNA sequence data mining technique. Journal of Software, 2007,18(11):2766−2781. http://www.jos.org.cn/1000-9825/18/2766.htm
Abstract: DNA sequence is one of the basic and important data among biological data. Researching DNA sequence data and then comprehending life essential is a necessary task in post-genomic era. At present, data mining technique is one of the most efficient data analysis means, which finds out information hidden in data. It has also become main data analysis technique adopted in Bioinformatics. It has been applied in DNA sequence analysis, which has got wide attention and rapid development. And considerable research achievements have emerged. Provides an overview of research progress in DNA sequence data mining field. In more detail, it proposes three research phases including statistics-based data mining methods application, general data mining methods application, and specialized DNA sequence-oriented data mining methods design, and then elaborates that sequence similarity is foundation of DNA sequence data mining technique. It also analyzes and comments some key techniques in this field by combining with biological background, such as DNA sequential pattern, association, clustering, classification and outlier mining. Finally, future work and open issues are given, including the research of a novel storage model and index methods, the design of data mining algorithm based on biological domain knowledge.
Key words: DNA sequence; data mining; bioinformatics; sequential pattern; sequence similarity
摘 要: DNA序列數(shù)據(jù)是一類重要的生物數(shù)據(jù).研究DNA序列數(shù)據(jù)解讀其含義是后基因組時代的主要研究任務.數(shù)據(jù)挖掘是目前最有效的數(shù)據(jù)分析手段之一,用于發(fā)現(xiàn)大量數(shù)據(jù)所隱含的各種規(guī)律,也是生物信息學采用的主要數(shù)據(jù)分析技術.將數(shù)據(jù)挖掘技術用于DNA序列數(shù)據(jù)分析,已得到了廣泛關注和快速發(fā)展,并取得了許多研究成果.綜述了DNA序列數(shù)據(jù)挖掘領域的研究狀況和進展,提出了3個研究階段:基于統(tǒng)計的挖掘方法應用階段、一般化挖掘方法應用階段和專門的DNA序列數(shù)據(jù)挖掘方法設計階段.闡述了DNA序列數(shù)據(jù)挖掘的基礎是序列相似性,評述了
∗ Supported by the National Natural Science Foundation of China under Grant No.60573093 (國家自然科學基金); the National High-Tech Research and Development Plan of China under Grant No.2006AA02Z329 (國家高技術研究發(fā)展計劃(863))
Received 2007-01-23; Accepted 2007-04-25
朱揚勇 等:DNA 序列數(shù)據(jù)挖掘技術 2767
DNA序列數(shù)據(jù)挖掘領域所采用的關鍵技術,包括DNA序列模式、關聯(lián)、聚類、分類和異常挖掘等,分析討論了其相應的生物應用背景和意義.最后給出DNA序列數(shù)據(jù)挖掘進一步研究的熱點問題,包括DNA |
|