FOCUS A SUPERVISED FORUM CRAWLER

Internet forums are important platforms where users can request and exchange information with others. For example, the Trip Advisor Travel Board is a place where people can ask and share travel tips. Due to the richness of information in forums, researchers are increasingly interested in mining knowledge from them. Tried to mine business intelligence from forum data. They proposed algorithms to extract expertise network in forums. Identified question and answer pairs in forum threads. According to an article from eMarketer – Where Are Social Media Marketers Seeing the Most Success? – forums are still part of the global social media strategy of the Top 500 Companies, and they are still getting really high marketing success with forums. To harvest knowledge from forums, their contents have to be downloaded first. Generic crawlers , which adopt a breadth first traversal strategy, are usually ineffective and inefficient for forum crawling. This is mainly due to two non-crawler-friendly characteristics of forums duplicate links & uninformative pages and page-flipping links. A forum usually has many duplicate links which point to a common page but with different URLs, e.g., shortcut links pointing to latest posts or URLs for user experience functions such as “view by title”. A generic crawler that blindly follows these links will trawl many duplicate pages that make it inefficient. A Forum typically has many uninformative pages such as login control to protect users’ privacy. Following these links, a crawler will trawl many uninformative pages. Though there are standard-based methods such as specifying the “rel” attribute with “nofollow” value, Robots Exclusion Standard, and Sitemap, for forum operators to instruct web crawlers on how to crawl a site effectively, we found that over a set of 9 test forums more than 47% of the pages trawled by a generic crawler following these protocols are duplicate or uninformative. This number is a little higher than the 40% that reported but both show the inefficiency of generic crawlers. Besides duplicate links & uninformative pages, a long forum board or thread is usually divided into multiple pages which are linked by page-flipping links, for example,. Generic crawlers process each page individually and ignore the relationship between such pages. These relationships should be preserved while crawling to facilitate downstream tasks such as page wrapping and content indexing. For example, multiple pages belonging to a thread should be concatenated together in order to extract all posts of this thread as well as the reply relationships between posts.