Abstract
With more than a trillion web pages, there is a plethora of content available for consumption. Search Engine queries invariably lead to overwhelming information, parts of it relevant and others
irrelevant. Often the information provided can be conflicting, ambiguous, and inconsistent, which can have serious consequences for people who increasingly rely on web sources for information related to security, health, academia, etc. Prior Research stresses upon the idea that traditional Search Engine
Optimization techniques tend to focus on making top-ranked results more and more relevant and mostly depend on the user’s personal information and site popularity. Regardless, people often use two
divergent terminologies- credibility and popularity, interchangeably.
Credibility, an important quality characteristic of web pages is questionable in many cases and tends to be non-uniform. Credibility refers to the degree to which a website could be relied upon. Principally, credibility can be thought of as a compass for guiding us safely through a world of uncertainty, risk
and moral hazards. Novice users who use search engines do not know how to start and lack sufficient knowledge for finding the best possible results. For a novice user, surface features such as fonts, colour, images and other layouts of the web page create the first impression on credibility [19]. While for most of the regular users of the internet, content relevance, information source, evolution of content and other fine-grained features of credibility decide web page usage. In past, researchers have proposed approaches for credibility assessment and enumerated features influencing the credibility of web pages. Assessment of few of those features can be automated using existing literature and contemporary knowledge; while others still need human intelligence. Web have
been expanding since its inception, with which various kinds of web pages are emerging, categorized as genre of web page, for example– Help, Article, Discussion, etc. Depending on the genre of web page, the importance of credibility features such as web page date time modified, grammar, image to text ratio, in and out links, and other web page features may differ for assessment. Therefore, credibility without factoring genre of web page can lead to incorrect assessment. We conducted a crowd-source survey over multiple channels, where we asked participants to mark individual importance of web page elements(features) across different web genres, on a Likert scale of 1 to 4. The surveyed results implied that the importance of each feature vary across genres, which
therefore supported our argument about the need for genre-aware credibility assessment of a web page. In this work, we propose an automated approach for credibility assessment of web page, where genre is also identified within to give human experts alike assessment results. We design a framework (called W EBCred) based on our proposed approach which accommodates various individual structures like – crawling, genre classification, normalization, scoring, etc. and keep them independent from each other to facilitate further extensib lilty. The proposed framework allows the addition of new genres, features and alter weightages providing flexibility for user intervention. To validate our proposed approach, we developed an Open-Source tool, which is capable of genre identification along with extraction and normalization of selected feature instance values to calculate a credible score (called GCS) of every web page. Few of these features were new and their extraction methodologies are defined by ourselves, as they are not explicit. Our tool is fully automated, such that it assess the Genre Credibility Score (GCS) of a given web page without any human aid. The source code [9] of developed tool is available
over Github for further extension, and is deployed [7] on web for testing. We carried out extensive experiments to establish the effectiveness of our approach. We experimented our approach with ‘Information Security’ dataset having 8,550 URLs with 171 features
across 7 genres. This dataset has been used for crowdsourced survey, training genre classification model and normalizing extracted feature instance values. The supervised learning algorithm, Gradient Boosted Decision Tree classified genres with 88.75% testing accuracy over 10 fold cross-validation,
which overcome the current benchmark (about 80%). The calculated GCS based on identified genres by our trained model, correlated 69% with crowdsourced Web Of Trust (W OT ) score and 13% with algorithm based Alexa ranking for selected ‘Information Security’ web pages. As a further validation of our trained model and overall approach, we tested our trained model on separate ‘Health’ domain web pages, which correctly classify genres with 82.26% accuracy. Further the calculated GCS for ‘Health’ web pages correlates 59% with W OT and 23% with Alexa ranking. C