Abstract
The exponential growth of world wide web has transformed it into an ocean of knowledge in which highly diverse information is linked in a extremely complex and arbitrary manner. It has become the repository for all the information needs of the users who search for answers to their diverse needs. Having been entrusted with the responsibility of serving the information needs of this ever- increasing pool of internet users, search engines, in order to sustain this highly competitive and huge market, are forced to mine this vast pool of information to serve information that is relevant, accurate and veritable. Typically, a search engine in these days, returns
hundreds of results in response to a user query. The user is left with the painful task of mining these results to find the information needed. But sometimes, these results may not contain the information needed by the user and hence the user has to rephrase the query and restart the search session. The search engine which minimizes
this process and makes it more user-friendly is bound to capture more market. Traditional search engines used to return generalized results for particular queries i.e., they used to rank the results based on query irrespective of the preferences of the
searcher. But such generalized ranking of results may not satisfy all the searchers since each searcher is unique and has different information need(sometimes even in case of similar queries), different tastes and preferences. Hence, of late, search engines have started concentrating on fetching personalized results according to the taste of each individual user. For example, for a query “Sourav ganguly”, a user who
likes both “Sourav Ganguly” and “Shahrukh khan” may be more interested in news about “IPL”, which has both the subjects involved in it, when compared to a user who does not like “Shahrukh khan”. Similarly, for a query “Apple”, a user who is a computer professional may like to know news about “Apple products” whereas a botanist or a dietician may want to know about “Apple” fruit. Thus search engines
started focussing on knowing the tastes and preferences of the users and retrieve results that they might find more relevant.
Personalization of search can be done by knowing information about the user like his topics of interests, his background knowledge on different topics and his
browsing model. This user information can be known by posing a generalized set of questions to the user through which we can learn the information we need about the searcher which is called explicit feedback. But again, such generalized set of questions shall limit the extent to which we can learn the searcher and also such feedback is expensive to collect due to the amount of effort of the user involved.
Hence researchers have sorted out a new type of feedback of late. Here, we collect the user interactions with the search engine without any effort from the user. Such feedback is called Implicit relevance feedback.
Implicit relevance feedback has received wide attention recently as a means to capture the search context and improving search results. However, implicit feedback is usually not available to public or even research communities at large for
reasons like being a potential threat to privacy of web users. Hence, researchers in personalization of search are compelled to generate their own datasets in lab environments. These datasets are generally domain-centric, uniform and restricted to
a small set of users. But if we examine the feedback data from a real world commercial search engines, it may range over a number of domains, varying in subject and content and may involve millions of users. Thus, personalization algorithms trained on lab generated dataset may not satisfy a real world search user. Hence, it is difficult to experiment and evaluate web search related research, especially web
search personalization algorithms.
Given these problems, we are motivated towards creating a synthetic user relevance feedback data, based on insights from query log analysis. We call this Simulated feedback. We believe that simulated feedback can be immensely beneficial to web search engine and personalization research communities by greatly reducing efforts involved in collecting user feedback. The benefits from ”Simulated feedback”
is that - It is easy to obtain and also the process of obtaining the feedback data is repeatable, customizable and does not need the interactions with the user. In this thesis, we describe a simple yet effective approach using Prognostic Search
process for creating simulated feedback. In this approach, we create virtual users from the user profile data of real world users and simulate search sessions of these users. We propose two models to represent the user profile of a virtual user viz.,
“Skewed Mean Model” and “Probabilistic Bayesian Model”. We achieved high accuracy levels with both the models. But the