Next Generation Semantics-Aware Data Management Systems

Sources of information and our dependence on them are increasing at a phenomenal rate. The most obvious example is the explosive growth and rapid evolution of the World Wide Web, but other projects in research, industry, healthcare and government also exhibit a critical dependence on the effective management and exploitation of large scale data. The kind of information available is also changing rapidly, and often includes un-structured and semi-structured data, streaming data, noisy and incomplete data, and linked datasets. Simultaneously dealing with the rapidly increasing size, complexity and heterogeneity of data presents a grand challenge for information systems research, and has created an urgent need for more capable information systems. Meeting this need will be critical to the UK's future competitiveness.

Informaion systems clearly have a key role to play in addressing these extremely complex problems, but they need to evolve to reflect the rapidly changing information landscape. This evolution is the basis for the emerging field of semantics-aware data management, which involves a synthesis of ontological reasoning and database management principles. Semantics-aware systems employ rich schemas (AKA ontologies) that allow them to deal with incomplete and semi-structured information from heterogeneous sources, and to answer queries in a way that reflects both knowledge and data, i.e., to deliver understanding from information.

We believe, however, that if such systems are to be widely applicable, then their enhanced capabilities must, be in addition to, and not instead of, the well-established features and high performance of existing database systems; moreover, we believe that they will need to incorporate techniques from many other areas of computer science, particularly those that give a complementary view of "Big Data" management, such as algorithms and machine learning, stream processing, and information retrieval. The goal of the Oxford Information Systems Group (ISG) is to develop next generation semantics-aware data management systems that fully realise the desired synthesis.

Research Contributions

Contributions to the iBench and gMark projects

Radu Ciucanu, a postdoc funded by DBOnto and working with Dan Olteanu, is collaborating with the Toronto DB group led by Renée J. Miller on the iBench project ( iBench is a metadata generator that can be used to evaluate a wide-range of integration tasks, such as data exchange, mapping creation, mapping composition, and schema evolution. iBench permits control over the size and characteristics of the metadata that it generates (schemas, constraints, and mappings), and has been already successfully used for several empirical evaluations of data integration systems. iBench will be presented in a VLDB'16 research paper and was also the basis of a VLDB'15 demonstration.

Moreover, Radu is currently contributing to the development of gMark, an open-source domain- and query language-independent graph benchmark that is able to target and control the diversity of properties of both the generated graph instances and the generated query workloads coupled to these instances. The gMark project is done in collaboration with research groups from Lyon, Lille and Eindhoven, and is led by Angela Bonifati and George Fletcher.

A project started during Radu's PhD in Lille and recently finalized during his postdoc addresses the problem of query specification for non-expert users and proposes a novel approach for learning relational join queries from simple user interactions. A journal paper on this topic has been recently accepted to appear in ACM TODS, and is coauthored with Angela Bonifati (Lyon) and Slawek Staworko (Lille/Edinburgh).

Declarative meta-data management in the LogicBlox smart database management system

Dan Olteanu has been collaborating in 2014 and 2015 with LogicBlox, Inc on declarative meta-data management in the LogicBlox smart database management system. His work, jointly with the LogicBlox runtime team, has been featured in two recent SIGMOD and VLDB publications co-authored by Dan.

The datalog language is central to ongoing research in the Department on query processing, reasoning, and static program analysis, and particularly in the context of the DBOnto project. During his recent sabbatical at LogicBlox, Dan contributed to a smart database system with currently scores of commercial applications. The system integrates handling of mixed transactional and analytical workloads, graph analyses, and predictive workloads that involve mathematical optimisation and machine learning, all expressed using a declarative datalog-like language. This system is overviewed in the ACM SIGMOD 2015 paper.

The VLDB 2015 paper details one technical challenge addressed by Dan and his LogicBlox colleague Dr TJ Green: efficient handling of updates to datalog programs (rather than source data) on running database servers. This is essential for live programming in the existing LogicBlox applications, where users can alter the program and expect its result to quickly change on the fly. Their solution uses declarative programming to improve the implementation of the declarative system: they introduced an engine for meta-data supporting declarative rules in an object-oriented (again) datalog-like language. Incremental view maintenance in the meta-engine takes care of propagating the effects of program updates correctly and efficiently.

Kaiser Permanente project

Robert Piro, a postdoc supported by DBOnto and working with Ian Horrocks and Boris Motik, has been collaborating on a project with the U.S. health care provider Kaiser Permanente. The aim of the project was to conduct data analysis in health care using RDFox, the RDF triple store and parallel SWRL reasoner developed in the past four years in the KRR Group of the Computer Science Department.

The data analysis task was to compute benchmark measures for health care providers in the U.S. which are issued by a U.S. government body for quality assurance. Accuracy of these measures is important as they are entry requirements for billing health care services against government funded schemes, such as Medicare which is the national insurance program for pensioners in the U.S.

The specification of these benchmark measures is stated in more or less precise natural language statements. These statements are rendered into machine processable programs which are then evaluated on the clinical data. Traditional approaches, however, suffer from being complex and difficult to maintain.

Our project follows a novel approach, by rendering these statements in the rule language SWRL. Each SWRL rule is a straight forward if-then-statement and therefore allows to implement the specification much closer to the original natural language. This increases maintainability and makes it much easier to judge the correctness of the rendering.

Moreover, clarity is improved as SWRL is a declarative language; in contrast to a procedural language, a declarative language puts the emphasis on "what" is to be computed and not "how" it is computed. This reduces length and increases transparency of the code.

Thanks to Kaiser Permanente's involvement, the result could be evaluated on real patient data and compared to the current implementation. The initial comparison showed 350 differences in a set of 265,000 patients of which 11,000 belong to the target group. This is, judged by experts, very low. Moreover, differences do not necessarily imply a fault in the SWRL rendering of the specification; they may imply a fault in the existing implementation.

Further collaboration with Kaiser is envisaged to prove shortened development times for such specifications when done with SWRL as well as directly translating future machine processable specifications (ECQMs) into SWRL rules.