Local Versus Integrated Interface Schemas

Web information extraction and annotation has been an active research area in recent years. Many systems rely on human users to mark the desired information on sample pages and label the marked data at the same time, and then the system can induce a series of rules (wrapper) to extract the same set of information on webpages from the same source. These systems are often referred as a wrapper induction system. Because of the supervised training and learning process, these systems can usually achieve high extraction accuracy. They suffer from poor scalability and are not suitable for applications that need to extract information from a large number of web sources. Utilize ontologies together with several heuristics to automatically extract data in multirecord documents and label them. Ontologies for different domains must be constructed manually. The presentation styles and the spatial locality of semantically related items, but its learning process for annotation is domain dependent. A seed of instances of semantic concepts in a set of HTML documents needs to be hand labeled. These methods are not fully automatic. The efforts to automatically construct wrappers are, but the wrappers are used for data extraction only (not for annotation). We are aware of several works which aim at automatically assigning meaningful labels to the data units in SRRs. Basically annotate data units with the closest labels on result pages. This method has limited applicability because many WDBs do not encode data units with their labels on result pages. In ODE system , ontologies are first constructed using query interfaces and result pages from WDBs in the same domain. The domain ontology is then used to assign labels to each data unit on result page. After labeling, the data values with the same label are naturally aligned. This method is sensitive to the quality and completeness of the ontologies generated. HTML tags to align data units by filling them into a table through a regular expression based data tree algorithm.