Information Retrieval System

Definition

An information Retrieval System is a system that is capable of storage, retrieval and maintenance of information.

Information here can be composed of text (including numeric and date data), images, audio, video and other multimedia objects.

An information Retrieval System consists of a software program that facilitates a user in finding the information the user needs.

Measures:

Precision =

Number_Retrieved_Relevant /Number_Total_Retrieved

Recall = Number_Retrieved_Relevant/Number_Possible_Relevant

Where

Number_Possible_Relevant is the number of relevant items in the database.

Number_Total_Retrieved is the total number of items retrieved from the query

Number_Retrieved_Relevant is the number of items retrieved that are relevant to the user’s search need.

Precision measures one aspect of information retrieval overhead for a user associated with a particular search.

Recall measures how well a system processing a particular query is able to retrieve the relevant items that the user is interested in seeing.

DBMS vs IRS

The integration of DBMS and Information Retrieval systems is very important.

Database systems work with relational databases(well structured data) while information retrieval system works on unstructured data (raw text files/documents).

In database systems main datastructure used is relational tables with well define values for each row and column on the other hand IR system uses Inverted index which is the index of {term,docIDs} entries. for each term there is a corresponding postings list(the list of documents in which the term is present)

Database sytems work on data that are related to each other and have a well defined domain while IR systems may or may not have such a luxury.

Inverted file

The name “inverted file” comes from its underlying

methodology of storing an inversion of documents:

Inversion of document from the perspective that, for each word, a list of documents in which the word is found in is stored (inversion list for that word)

Each document in the system is given a unique numerical

identifier.

It is that identifier that is stored in the inversion list.

The way to locate the inversion list for a particular word is via the Dictionary

The dictionary is typically a sorted list of all unique words

(processing tokens) in the system and a pointer to the

location of its inversion list (as shown in below figure).

Dictionaries can also store other information used in query

optimization such as length of inversion lists

Additional information may be used from the item to

increase precision and provide a more optimum inversion

list file structure

One of the first commercial databases to integrate the two systems into a single view is the INQUIRE DBMS ( past 15years)

A more current example is the ORACLE DBMS that offers an imbedded capability called CONVECTIS.

The INFORMIX DBMS has the ability to link to RetrievalWare to provide integration of structured data and Information along with functions associated with Information retrieval systems.

Wanted to learn more?

1. The process of extracting information from unstructured textual data is referred to as information retrieval (IR).

2. The main emphasis in IR system is on querying based on keywords and ranking documents on the basis of their relevance to the query.

3. The users of IR systems request to retrieve a particular set of documents by providing a set of words or keywords. IR system locates and returns the documents whose associated keywords match with the given keywords.

4. One of the main features of IR system is ranking of search documents in terms of their relevance to the query. The relevance of a document to a query is estimated by using information such as term occurrence. The top-ranked documents are shown to the user as the query result.

5. IR systems also support full text retrieval in which each word in the document is considered to be a keyword. In full text retrieval where each word is a keyword, we can use term to refer words in a document.

6. To handle billions of documents in Web, vector space model is used. In this model, each document is transformed from full text version to a document vector, which describes the contents of the document. Each word that appears in the collection of documents corresponds to a dimension in the vector, and each document is represented as a vector with one entry per word.

7. In document vector, the value for a term is the number of occurrences of that word in the document or simply the term frequency (TF). Term frequency can be used in measuring the relevance of a document to a given term.

8. All the terms that are used as keywords are not equally significant, thus, we should give more importance to significant terms to get better results. This can be achieved by assigning weights to terms in the document vector by using inverse document frequency (IDF).

9. Certain IR systems allow similarity-based retrieval, whereby users can retrieve documents that are similar to a given document d. Similarity between two documents d₁ and d₂ is defined by cosine similarity metric.

10. The two widely used measures for evaluating the performance of IR systems are recall and precision. Generally, it is relatively easy for an IR system to achieve either high recall or high precision; however, it is challenging to achieve both.

11. Efficient index structure plays an important role in efficient processing of queries in an IR system. Two widely used indexing techniques are inverted index and signature file index.

12. Documents have links (hyperlinks) to other documents, which help in locating relevant documents for a given search.

13. Web search engines crawl the Web to locate and gather information found in the documents to a combined index. Crawling consists of several processes that run on multiple machines.

14. To keep the index up to date, the pages are re-fetched periodically to obtain the updated information.

15. Just like the way index of a book keeps links to related topics in the book, good web pages are linked too. In fact, the web pages are classified into two types, authorities and hubs.

16. The hub itself may not contain actual information on a topic, but stores links to many related pages that contain actual information. On the other hand, authority page is one that contains actual information on a topic; however, it may not contain links to other related pages.

17. The relationship between hubs and authorities influences a link-based search algorithm, the HITS algorithm, which retrieves highly relevant pages in response to a given query.

18. The HITS algorithm proceeds in two steps, namely, sampling step and iteration step.

19. In the sampling step, the algorithm collects the pages that are most relevant to the given query using some traditional method. The resultant set of pages is called the base set. A web page in the base set is referred to as base page.

20. Each base page is associated with a hub-prestige value and an authority-prestige value. Hub-prestige value indicates the quality of the page as a hub, and the authority-prestige value indicates the quality of the page as an authority.

21. The pages with good authority-prestige value and hub-prestige value among the base pages are located in the iteration step.

Data Mining Lab

Information Retrieval System FOR DBMS

Definition

Measures:

DBMS vs IRS

Inverted file

Wanted to learn more?