《信息檢索:算法與啟發(fā)式方法(英文版·第2版)》是“信息檢索”課程的優(yōu)秀教材,書中對信息檢索的概念、原理和算法進行了詳細介紹,內容主要包括檢索策略、檢索實用工具、跨語言信息檢索、查詢處理、集成結構化及數(shù)據(jù)和文本、并行信息檢索以及分布式信息檢索等,并給出了闡述算法的大量實例。
《信息檢索:算法與啟發(fā)式方法(英文版·第2版)》有一定的深度和廣度,而且所有的內容都用當前的技術闡述,是高等院校計算機及信息管理等相關專業(yè)本科生和研究生的理想教材,對信息檢索領域的科研和技術人員也是很好的參考書。
格羅斯曼(David A.Grossman),佐治亞梅森大學博士,F(xiàn)在伊利諾伊理工大學計算機系任教。曾在美國政府部門高級技術服務中心和研究發(fā)展辦公室擔任項目經(jīng)理。主要研究領域包括信息檢索、結構化與非結構化數(shù)據(jù)集成以及數(shù)據(jù)挖掘。
弗里德(Ophir Frieder),伊利諾伊理工大學計算機系首席教授、學院數(shù)據(jù)檢索實驗室主任ACM會員,IEEE和美國藝術與科學研究院高級會員,他在數(shù)據(jù)檢索系統(tǒng)、通信系統(tǒng)、高性能系統(tǒng)結構等方面均有研究。
1. INTRODUCTION
2. RETRIEVAL STRATEGIES
2.1 Vector Space Model
2.2 Probabilistic Retrieval Strategies
2.3 Language Models
2.4 Inference Networks
2.5 Extended Boolean Retrieval
2.6 Latent Semantic Indexing
2.7 Neural Networks
2.8 Genetic Algorithms
2.9 Fuzzy Set Retrieval
2.10 Summary
2.11 Exercises
3. RETRIEVAL UTILITIES
3.1 Relevance Feedback
3.2 Clustering
3.3 Passage-based Retrieval
3.4 N-grams
3.5 Regression Analysis
3.6 Thesauri
3.7 Semantic Networks
3.8 Parsing
3.9 Summary
3.10 Exercises
4. CROSS-LANGUAGE INFORMATION RETRIEVAL
4.1 Introduction
4.2 Crossing the Language Barrier
4.3 Cross-Language Retrieval Strategies
4.4 Cross Language Utilities
4.5 Summary
4.6 Exercises
5. EFFICIENCY
5.1 Inverted Index
5.2 Query Processing
5.3 Signature Files
5.4 Duplicate Document Detection
5.5 Summary
5.6 Exercises
6. INTEGRATING STRUCTURED DATA AND TEXT
6.1 Review of the Relational Model
6.2 A Historical Progression
6.3 Information Retrieval as a Relational Application
6.4 Semi-Structured Search using a Relational Schema
6.5 Multi-dimensional Data Model
6.6 Mediators
6.7 Summary
6.8 Exercises
7. PARALLEL INFORMATION RETRIEVAL
7.1 Parallel Text Scanning
7.2 Parallel Indexing
7.3 Clustering and Classification
7.4 Large Parallel Systems
7.5 Summary
7.6 Exercises
8. DISTRIBUTED INFORMATION RETRIEVAL
8.1 A Theoretical Model of Distributed Retrieval
8.2 Web Search
8.3 Result Fusion
8.4 Peer-to-Peer Information Systems
8.5 Other Architectures
8.6 Summary
8.7 Exercises
9. SUMMARY AND FUTURE DIRECTIONS
References
Index
3.4.1 DAmore and Mah
Initial information retrieval research focused on n-grams as presented in[DAmore and Mah, 1985]. The motivation behind their work was the fact thatit is difficult to develop mathematical models for terms since the potential fora term that has not been seen before is infinite. With n-grams, only a fixednumber of n-grams can exist for a given value of n. A mathematical modelwas developed to estimate the noise in indexing and to determine appropriatedocument similarity measures. DAmore and Mahs method replaces terms with n-grams in the vector spacemodel. The only remaining issue is computing the weights for each n-gram.Instead of simply using n-gram frequencies, a scaling method is used to nor-malize the length of the document. DAmore and Mahs contention was that alarge document contains more n-grams than a small document, so it should bescaled based on its length. To compute the weights for a given n-gram, DAmore and Mah estimatedthe number of occurrences of an n-gram in a document. The first simplifyingassumption was that n-grams occur with equal likelihood and follow a binomialdistribution. Hence, it was no more likely for n-gram "ABC" to occur than"DEE" The Zipfian distribution that is widely accepted for terms is not true forn-grams. DAmore and Mah noted that n-grams are not equally likely to occur,but the removal of frequently occurring terms from the document collectionresulted in n-grams that follow a more binomial distribution than the terms. DAmore and Mah computed the expected number of occurrences of an n-gram in a particular document. This is the product of the number of n-gramsin the document (the document length) and the probability that the n-gramoccurs. The n-grams probability of occurrence is computed as the ratio ofits number of occurrences to the total number of n-grams in the document.DAmore and Mah continued their application of the binomial distribution toderive an expected variance and, subsequently。