Improving Robustness Against Websites’ Changes During Web Data Extraction

General approach to dealing with changes to the websites’ structure during web extraction is to optimize the XPath expressions before executing the wrapper. We propose a novel approach to wrapper robustness based on machine learning, applied during, or more precisely, after the extraction. When an XPath expression fails as a result of a new change to the web page’s structure, we apply  binary classification to identify the desired HTML element. Based on this element a new XPath expression is generated. We will evaluate our method on a series of snapshots of selected webpages, measuring not only the accuracy of our classificator, but also the duration until our self-repairing wrapper definitivelly fails.