An Empirical Analysis of the Co-evolution of Schema and Code in Database Applications
This is a supplementary web page to accompany our paper which was accepted in the Proceedings of ESEC/FSE 2013.
Overview
Modern database applications are among the most widely used and complex software systems. They constantly evolve,responding to changes to data, database schemas, and code. It is challenging to manage these changes and ensure that everything co-evolves consistently. For example, when a database schema is modified, all the code that interacts with the database must be changed accordingly. Although database evolution and software evolution have been extensively studied in isolation, the co-evolution of schema and code has largely been unexplored.
This paper presents the first comprehensive empirical analysis of the co-evolution of database schemas and code in ten popular large open-source database applications, totaling over 160K revisions. Our major findings include: 1) Database schemas evolve frequently during the application lifecycle, exhibiting a variety of change types with similar distributions across the studied applications; 2) Overall,schema changes induce significant code-level modifications, while certain change types have more impact on code than others; and 3) Co-change analyses can be viable to automate or assist with database application evolution. We have also observed that: 1) 80% of the schema changes happened in 20-30% of the tables, while nearly 40% of the tables did not change; and 2) Referential integrity constraints and stored procedures are rarely used in our studied subjects. We believe that our study reveals new insights into how database applications evolve and useful guidelines for designing assistive tools to aid their evolution.
More details are shown in the full version of this study.
Data
We studied ten poplular database applications, the data collected from these projects are listed as follows:
Coppermine - a multi-purpose fully-featured and integrated web picture gallery script written in PHP.
SVN Repo schema files (116 versions) schema changes code changesDotproject - a volunteer supported Project Management application.
SVN Repo schema files (163 versions) schema changes code changese107 - is a free (open-source) content management system allowing you to easily manage and publish your content.
SVN Repo schema files (76 versions) schema changes code changesJoomla! - is a content management system (CMS), which enables you to build Web sites and powerful online applications.
SVN Repo schema files (532 versions) schema changes code changesMediaWiki - is a free software open source wiki package written in PHP, originally for use on Wikipedia.
SVN Repo schema files (377 versions) schema changes code changesPrestaShop - is a free, open source e-commerce solution.
SVN Repo schema files (211 versions) schema changes code changesRoundCube - is a free and open source webmail software for the masses, written in PHP.
SVN Repo schema files (56 versions) schema changes code changesTikiWiki - a free and open source web application with the most built-in features.
SVN Repo schema files (941 versions) schema changes code changesTYPO3 - is a free and open source web content management framework based on PHP.
SVN Repo schema files (73 versions) schema changes code changeswebERP - webERP is a complete web based accounting and business management system.
SVN Repo schema files (189 versions) schema changes code changes
Results
RQ1: How frequently and extensively do schemas evolve?
(1) The evolution trend of tables/columns in the studied projects.
dataset gnuplot script graph
(2) Growth and change rates of schema size.
tablesRQ2: How do database schemas evolve?
(1) Distribution of atomic changes w.r.t. the high-level schema change categories.
dataset gnuplot script graph
(2) Distribution of atomic changes w.r.t. the low-level schema change categories.
tables
(3) Distribution of addition/deletion/change on schema.
dataset gnuplot script graphRQ3: How much application code has co-changed with a schema change?
(1) Manual study on the valididy of co-change analysis.
(i) How many valid DB revisions contain the co-change information of schema and code?
dataset gnuplot script graph
(ii) How much code-level change is truly caused by schema changes?
dataset gnuplot script graph
(2) Estimated co-changed code size w.r.t. high-level schema changes.
dataset R script
Discussions
(1) Coverage of schema changes by tables.
dataset gnuplot script graph
(2) Distribution on the number of atomic schema changes across different tables.
dataset gnuplot script graph
Code
DAChangeAnalyzer - an analyzer that can extract changed lines, functions, classes and files from the selected revisions in SVN repository.
Tools
-
SVNKit - A pure Java toolkit to work with subversion repositories.
-
Java-diff-utils - an opensource library for performing the comparison operations between texts.
-
Mysqldiff - is suite of Perl modules and accompanying CLI script for comparing the schema (table structures) of two MySQL databases.
-
R - is a software environment for statistical computing and graphics.
-
Gnuplot - is portable command-line driven graphing utility.
Authors
-
Dong Qiu - Southest University, China
E-mail: dongqiu at seu dot edu dot cn -
Bixin Li - Southest University, China
E-mail: bx.li at seu dot edu dot cn -
Zhendong Su - University of California, Davis, USA
E-mail: su at cs dot ucdavis dot edu