Diff for "InformationRetrieval"

Differences between revisions 2 and 17 (spanning 15 versions)

Back to ComputerTerms

Terms

This link provides a nice glossary of terms http://www.cs.jhu.edu/~weiss/glossary.html

Description

Examples: Library catalogs

Generally the data are organized as a collection of documents.

Querying

Querying of unstructured textual data is referred to as Information Retrieval. It covers the following areas:

Querying based on keywords
The relevance of documents to the query
The analysis, classification and indexing of documents.

Queries are formed using keywords and logical connectives and, or, and not where the and connective is implicit.

Full Text --> All words in a document are keywords. We use term to refer to words in a document, since all words are keywords.

Given a document d, and a term t one way of defining the relavence r is

$$$r(d,t)=\log\left(1+\frac{n(d,t)}{n(d)}\right)$$$

n(d) denotes the number of terms in the document, and n(d,t) denotes the number of occurrences of term t in the document d.

KEY: In the information retrieval community, the relevance of a document to a term is referred to as term frequency, regardless of the exact formula used.

Inverse Document frequency defined as:

$$$IDF = \frac{1}{n(t)}$$$

where n(t) denotes the number of documents that contain the term t.

Here we have a low IDF if the word is found in many of the documents. If it is found in only a few, then it is probably a good term to use!

Thus the relavance of a document d to a set of terms Q is then defined as

$$$r(d,Q)=\sum_{t \in Q}\frac{r(d,t)}{n(t)}$$$

$$$r(d,Q)=\sum_{t \in Q}\frac{w(t) r(d,t)}{n(t)}$$$

where w(t) is a weight specified by the user.

KEY: Stop words are words that are not indexed such as and, or the, a etc.

Proximity: if a the terms occur close to each other in the document, the document would be ranked higher than if they occur far apart. We could (although we don't) modify the formula $$r(d,Q)$$ to take proximity into account.

Back to ComputerTerms

-  ⇤ ← Revision 2 as of 2004-04-08 01:29:44 → 
  Size: 97
  Editor: yakko
  Comment:
+   ← Revision 17 as of 2021-10-26 16:08:37 → ⇥
  Size: 2423
  Editor: scot
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 5:
-ControlledVocabulary
InvertedFile
+   * ClusteringAlgorithms
   * ControlledVocabulary
   * InvertedFile
   * LexicalAnalysis
   * [[Precision]]
   * [[Recall]]
   * RelevanceFeedback
   * SignatureFile
   * StemmingConflation
   * StopWords
   * SuperImposedCoding
   * TruncationConflation
   * ThesaurusConflation

'''This link provides a nice glossary of terms http://www.cs.jhu.edu/~weiss/glossary.html'''

== Description ==

Examples: Library catalogs

Generally the '''data''' are organized as a collection of '''documents'''.

== Querying ==

Querying of unstructured textual data is referred to as '''Information Retrieval'''. It covers the following areas:

   * Querying based on keywords
   * The relevance of documents to the query
   * The analysis, classification and indexing of documents.

Queries are formed using keywords and logical connectives ''and, or,'' and ''not'' where the ''and'' connective is implicit.

'''Full Text''' --> All words in a document are ''keywords''. We use '''term''' to refer to words in a document, since all words are keywords.

Given a document ''d'', and a term ''t'' one way of defining the relavence ''r'' is 

$$$r(d,t)=\log\left(1+\frac{n(d,t)}{n(d)}\right)$$$

n(d) denotes the number of terms in the document, and n(d,t) denotes the number of occurrences of term t in the document d.

KEY: In the information retrieval community, the relevance of a document to a term is referred to as '''term frequency''', regardless of the exact formula used.

Inverse Document frequency defined as:

$$$IDF = \frac{1}{n(t)}$$$ 

where n(t) denotes the number of documents that contain the term t.

Here we have a low IDF if the word is found in many of the documents. If it is found in only a few, then it is probably a good term to use!

Thus the '''relavance''' of a document ''d'' to a set of terms ''Q'' is then defined as 

$$$r(d,Q)=\sum_{t \in Q}\frac{r(d,t)}{n(t)}$$$

$$$r(d,Q)=\sum_{t \in Q}\frac{w(t) r(d,t)}{n(t)}$$$

where w(t) is a weight specified by the user.

'''KEY: Stop words''' are words that are not indexed such as ''and, or the, a'' etc.

'''Proximity''': if a the terms occur close to each other in the document, the document would be ranked higher than if they occur far apart. We could (although we don't) modify the formula $$r(d,Q)$$ to take proximity into account.
-Line 9:
+Line 68: