Information Retrieval System
Definition
An
information Retrieval System is a system that is capable of storage, retrieval
and maintenance of information.
Information
here can be composed of text (including numeric and date data), images, audio,
video and other multimedia objects.
An
information Retrieval System consists of a software program that facilitates a
user in finding the information the user needs.
Measures:
Precision =
Number_Retrieved_Relevant
/Number_Total_Retrieved
Recall = Number_Retrieved_Relevant/Number_Possible_Relevant
Where
Number_Possible_Relevant is the
number of relevant items in the database.
Number_Total_Retrieved is the
total number of items retrieved from the query
Number_Retrieved_Relevant is
the number of items retrieved that are relevant to the user’s search need.
Precision measures one aspect of information
retrieval overhead for a user associated with a particular search.
Recall measures
how well a system processing a particular query is able to retrieve the relevant
items that the user is interested in seeing.
DBMS
vs IRS
The
integration of DBMS and Information Retrieval systems is very important.
Database systems work
with relational databases(well structured data) while information retrieval
system works on unstructured data (raw text files/documents).
In database systems main datastructure used is relational tables with well define values for each row and column on the other hand IR system uses Inverted index which is the index of {term,docIDs} entries. for each term there is a corresponding postings list(the list of documents in which the term is present)
Database sytems work on data that are related to each other and have a well defined domain while IR systems may or may not have such a luxury.
Inverted file
The
name “inverted file” comes from its underlying
methodology
of storing an inversion of documents:
Inversion of document from the perspective that, for each
word, a list of documents in which the word is found in is stored (inversion list
for that word)
Each
document in the system is given a unique numerical
identifier.
It is that identifier that is stored in the inversion list.
The
way to locate the inversion list for a particular word is via the Dictionary
The
dictionary is typically a sorted list of all unique words
(processing
tokens) in the system and a pointer to the
location
of its inversion list (as shown in below figure).
Dictionaries can also store other information used in query
optimization
such as length of inversion lists
Additional information may be used from the item to
increase
precision and provide a more optimum inversion
list file
structure
One of the first commercial databases to
integrate the two systems into a single view is the INQUIRE DBMS ( past
15years)
A more current example is the ORACLE DBMS that offers an
imbedded capability called CONVECTIS.
The
INFORMIX DBMS has the ability to link to RetrievalWare to provide integration
of structured data and Information along with functions associated with Information
retrieval systems.
Wanted to learn more?
1. The process of
extracting information from unstructured textual data is referred to as
information retrieval (IR).
2. The main emphasis
in IR system is on querying based on keywords and ranking documents on the
basis of their relevance to the query.
3. The users of IR
systems request to retrieve a particular set of documents by providing a set of
words or keywords. IR system locates and returns the documents whose associated
keywords match with the given keywords.
4. One of the main
features of IR system is ranking of search documents in terms of their
relevance to the query. The relevance of a document to a query is estimated by
using information such as term occurrence. The top-ranked documents are shown
to the user as the query result.
5. IR systems also
support full text retrieval in which each word in the document is considered to
be a keyword. In full text retrieval where each word is a keyword, we can use
term to refer words in a document.
6. To handle billions
of documents in Web, vector space model is used. In this model, each document
is transformed from full text version to a document vector, which describes the
contents of the document. Each word that appears in the collection of documents
corresponds to a dimension in the vector, and each document is represented as a
vector with one entry per word.
7. In document vector,
the value for a term is the number of occurrences of that word in the document
or simply the term frequency (TF). Term frequency can be used in measuring the
relevance of a document to a given term.
8. All the terms that
are used as keywords are not equally significant, thus, we should give more
importance to significant terms to get better results. This can be achieved by
assigning weights to terms in the document vector by using inverse document
frequency (IDF).
9. Certain IR systems
allow similarity-based retrieval, whereby users can retrieve documents that are
similar to a given document d. Similarity
between two documents d1 and d2 is defined by
cosine similarity metric.
10.
The
two widely used measures for evaluating the performance of IR systems are
recall and precision. Generally, it is relatively easy for an IR system to
achieve either high recall or high precision; however, it is challenging to
achieve both.
11.
Efficient
index structure plays an important role in efficient processing of queries in
an IR system. Two widely used indexing techniques are inverted index and
signature file index.
12.
Documents
have links (hyperlinks) to other documents, which help in locating relevant
documents for a given search.
13.
Web
search engines crawl the Web to locate and gather information found in the
documents to a combined index. Crawling consists of several processes that run
on multiple machines.
14.
To
keep the index up to date, the pages are re-fetched periodically to obtain the
updated information.
15.
Just
like the way index of a book keeps links to related topics in the book, good
web pages are linked too. In fact, the web pages are classified into two types,
authorities and hubs.
16.
The
hub itself may not contain actual information on a topic, but stores links to
many related pages that contain actual information. On the other hand,
authority page is one that contains actual information on a topic; however, it may
not contain links to other related pages.
17.
The
relationship between hubs and authorities influences a link-based search
algorithm, the HITS algorithm, which retrieves highly relevant pages in
response to a given query.
18.
The
HITS algorithm proceeds in two steps, namely, sampling step and iteration step.
19.
In
the sampling step, the algorithm collects the pages that are most relevant to
the given query using some traditional method. The resultant set of pages is
called the base set. A web page in the base set is referred to as base page.
20.
Each
base page is associated with a hub-prestige value and an authority-prestige
value. Hub-prestige value indicates the quality of the page as a hub, and the
authority-prestige value indicates the quality of the page as an authority.
21.
The
pages with good authority-prestige value and hub-prestige value among the base
pages are located in the iteration step.