Proposed a method for learning regular expression patterns of URLs that lead a crawler from an entry page to target pages. Target pages were found through comparing DOM trees of pages with a pre-selected sample target page. It is very effective but it only works for the specific site from which the sample page is drawn. The same process has to be repeated every time for a new site. Therefore, it is not suitable to large- scale crawling. In contrast, FoCUS learns URL patterns across multiple sites and automatically finds forum entry page given a page from a forum. Experimental results show that FoCUS is effective in large scale forum crawling by leveraging crawling knowledge learned from a few annotated forum sites. Discover and traverse URLs. developed some heuristic rules to discover URLs, but their rules are too specific and can only be applied to specific forums powered by the particular forum software package in which the heuristics were conceived. Unfortunately, according to ForumMatrix , there are hundreds of different forum software packages on the internet. Please refer to for more information about different forum software packages. In addition, many forums use their own customized software. A recent and more comprehensive work on forum crawling is iRobot. iRobot aims to automatically learn a forum crawler with minimum human intervention by sampling forum pages, clustering them, selecting informative clusters via an informativeness measure, and finding a traversal path by a spanning tree algorithm
You are here: Home / ieee projects 2014 / SIMILAR IMPLICIT NAVIGATION PATHS CONNECTED BY SPECIFIC URL TYPES