OXPath is a language designed for scalable web data extraction (scraping), crawling and automation.
OXPath extends XPath with actions (e.g., click, form filling), Kleene star for iteration, and markers for data extraction.
Simple OXPath example doc("http://dblp.dagstuhl.de/pers/hd/g/Gottlob:Georg") /.:<collection> [ .:<author="Georg Gottlob"> :<articles> [ //ul[@class="publ-list"]/li[@class~="entry"] /div[@class="data"]:<article> [./span[@itemprop="author"]:<author=normalize-space(.)>] [./span[@itemprop="name"]:<title=normalize-space(.)>] [.//span[@itemprop="isPartOf"] :<publication=normalize-space(.)>] [?./span[@itemprop="pagination"]:<pages=normalize-space(.)>] ] ]
OXPath expressions can be evaluated with OXPath command line interface client, OXPath CLI, (version 1.0.1) for OXPath 2.2.0. OXPath runs in a real browser, therefore every web page rendered by a modern browser can be interacted with and extracted with perfect accuracy. For this, OXPath version 2.2.0, as part of OXPath Project 1.0.3, relies on WebDriver (Selenium 2.53.1 with Firefox 47.0.1). All OXPath components, except WebAPI module integrating WebDriver, are licensed under the 3-Clause BSD License. OXPath CLI provides a command line interface for executing OXPath wrappers and saving the extracted data either on the file system in different formats such as XML, CSV, and JSON or in a relational database. OXPath 2.2.0 and OXPath CLI 1.0.1 can be executed on Linux platforms only, however, other platforms might be supported in future releases.
In Introduction to OXPath you can find a detail description of OXPath with various examples.
For citation and full description of OXPath and its semantics please refer to the article, OXPath: A language for scalable data extraction, automation, and crawling on the deep web or our tutorial, Introduction to OXPath.