FoCUS Modules of Index And Thread URL Detection

Internet forums are important platforms where users can request and exchange information with thers. The TripAdvisor Travel Board is a place where people can ask and share travel tips. Due to the richness of information in forums, researchers are increasingly interested in mining knowledge from them. To harvest knowledge from forums, their contents have to be downloaded first. Generic crawlers, which adopt a breadthfirst traversal strategy, are usually ineffective and inefficient for forum crawling. This is mainly due to two non-crawler-friendly characteristics of forums: duplicate links & uninformative pages and page-flipping links. A forum usually has many duplicate links which point to a common page but with different URLs, e.g., shortcut links pointing to latest posts or URLs for user experience functions such as “view by title”. A generic crawler that blindly follows these links will trawl many duplicate pages that make it inefficient. A Forum typically has many uninformative pages such as login control to protect users’ privacy. Besides duplicate links & uninformative pages, a long forum board or thread is usually divided into multiple pages which are linked by page-flipping links . Generic crawlers process each page individually and ignore the relationship between such pages. These relationships should be preserved while crawling to facilitate downstream tasks such as page wrapping and content indexing. There is also the problem of entry URL discovery. A forum’s entry URL points to its home page, which is the lowest common ancestor page of all threads. A crawler starting from an entry URL could achieve much higher performance than starting from other URLs. A supervised web-scale forum crawler, to address these challenges. The goal of FoCUS is to trawl relevant content, user posts, from forums with minimal. Forums exist in many different layouts or styles and powered by a variety of forum software packages, but they always have implicit navigation paths to lead users from entry pages to thread pages.