About blohen

Professional software developer and software architect. Worked in corporate, found a way out. Co-owner and co-founder of ZFabrik Software, house of fine software development. Developer of z2 Environment and numerous software solutions.

Z2-environment Version 2.8 is Available

Finally, version 2.8 is available for download and use. Version 2.8 comes with some useful improvements.

Please check out the wiki and online documentation.

Support for Java 13

Version 2.8 requires Java 9 and runs with Java up to Version 13 and supports a language level up to Java 13 based on the Eclipse Java Compiler ECJ 4.14 (#2035).

Upgraded to Jetty 9.4.24

While that was rather a necessity to run on Java 13, it was also kind of nice to be up-to-date again (#2052)

Follow HEAD!

Previously it was kind of cumbersome to change Git Component Repo declarations when working with feature branches, or to make a connected system switch branches when implementing a system local repo.

At this point, I probably lost you. Anyway, as z2 is self-serving its code and configuration, it would be really cool, if switching branches would be just that: Switching branches and (of course) synchronizing z2. And that is true now.

A Git Component Repository declaration may now use “HEAD” as a reference:

In that case, whatever is the current branch of the repo: Z2 will follow.

Remote Management Goodies

Z2 exposes metrics and basic operations via JMX. Via JMX or via the simple admin user interface, you can check on runtime health and trigger synchronization or reload of worker processes for example. Some things were – in practice – still rather user unfriendly:

  • Synchronizing a remote installation from the command line;
  • Accessing the main log remotely.

There is now a simple to use command line integrated with Z2 that can be used to do just that: Trigger some JMX implemented function and stream back the log. Or just simply continuously stream the log to a remote console.

Remote log streaming is also available from the admin user interface:

More…

Check out the version page for more details. Go to download and getting started in five minutes or check out some samples.

Posted in z2

A Model for Distributed System-Centric Development

It’s been a while and fall has been very busy. I am working on z2 version 2.8 which will bring some very nice remote management additions to simplify managing a distributed application setup.That was the motivation behind this post.

This post is about a deployment approach for distributed software systems that is particularly useful for maintenance and debugging.

But let‘s start from the beginning – let‘s start from the development process.

Getting Started

The basic model of any but the most trivial software development is based on checking out code and configuration from some remotely managed version control system (or Software Configuration Management system, SCM) to a local file system, updating it as needed and testing it on a local execution environment:

At least for the kind of application I care about, various versions, for development, testing, and productive use are stored in version control. In whatever way, be it build and deploy or pull, the different execution environments get updated from changes in the shared SCM. Tagging and branching is used to make sure that latest changes are separated from released changes. Schematically, the real situation is more like this:

There are good reasons we want to have permanent deployments for testing and staging: In large and complex environment a pre-production staging system may consist of a complex distributed setup that integrates with surrounding legacy or mocked third-party systems and have corresponding configurations. In order to collaboratively test workflows, check system configurations, and test with historic data, it is not only convenient but really natural and to have a named installation to turn to. We call that a test system. But then:

How do you collaboratively debug and hotfix a distributed test system?

For compile-package-deployment technologies, you could setup a build pipeline and a distributed deployment mechanism that allows you to push changes you applied locally on your pc to the test system installation. But that would only be you. In order to share and collaborate with other developers on the test system, you need some collaborative change tracking. In other words, you should use an SCM for that.

Better yet, you should have an SCM as an integral part of the test system!

Using an SCM as Integral Part of the System

Here is one approach like that. We are assuming that our test system has a mechanism to either pull changes from an SCM or there is a custom build and deploy pipeline to update the test system from a named branch. Using the z2-Environment, we strongly prefer a pull approach – due to its inherently better robustness.

From a test system‘s perspective we would see this:

Here „test-system“ is the branch defining the current code and configuration of the test system deployment. We simply assume there is a master development branch and a release branch that is still in testing.

So, any push to „test-system“ and a following „pull“ by the test system leads to a consistently tracked system update.

Let‘s assume we are using a distributed version control system (DVCS) like Git. In that case, there is not only an SCM centrally and on the test system, but your development environment has just as capable an SCM. We are going to make use of that.

Overall we are here now:

What we added in this picture is a remote reference to the test-system branch of the test system’s SCM from our local development SCM. That will be important for the workflows we discuss next.

The essence of our approach is that a DVCS like Git provides us a common versioning graph spanning multiple repositories.

Example Workflows

Let‘s play through two main workflows:

  • Consistent and team-enabled update of the test system without polluting the main code line
  • Extracting fix commits from test commits and consolidating the test system

Assume we are in the following situation: In our initial setup, we have a main code line (master) and a release branch. Both have been pushed to our central repository (origin). The test system is supposed to run the release branch but received one extra commit (e.g. for configuration). We omitted the master branch from the test-system repository for clarity. In our development repository (local), we have the master branch, the release branch as well as the test-system branch. The latter two from different remotes respectively. We have remote branches origin/master, origin/release, test-system/test-system to reflect that. We will however not show those here unless that adds information:

In oder to test changes on the test system, we develop locally, push to the test system repo and have the test system be updated from there there. None of that affects the origin repository. Let‘s say we need two rounds:

We are done testing our change with the test system. We want to have the same change in the release and eventually in the master branch.

The most straightforward way of getting there would be to merge the changes back into release and then into master. We did not write particularly helpful commit messages during testing however. For the history of the release and the development branch we prefer some better commit log content. That is why we are squash-merging the test commits the release branch and merge the resulting good commit into release.

After that we can push the release branch and master changes to origin:

While this leads to a clean history centrally, it puts our test system into an unfortunate state. The downside of a squash-merge is that there is no relationship between the resulting commit and the originating history anymore. If we would now merge the „brown“ commit into the test-system branch we would most likely end up with merge conflicts. That may still be the best way forward as it gets you a consistent relationship with the release branch and includes testing information.

At times however, we may want to „reset“ the test-system into a clean state again. In that case, we can do something that we would not allow on the origin repository: Overwrite the test-system history with a new, clean history, starting at where we left off initially. That is, we reset the test-system branch, merge the release commit, and finally force push the new history.

Now after this, the test system has a clean history that shows a history as we would have it when updating with release branch updates normally. None of what we did had any impact on the origin repository until we decided for meaningful changes.

Summary

What looked rather complicated was actually not more then equipping a runtime environment with its own change history and using some ordinary Git versioning „trickery“ to walk through some code and configuration maintenance scenario. We turned an execution environment into a long-living system with a configuration history.

The crucial pre-requisite for any such scenario is the ability of the runtime environment to be updated automatically and easily from a defining configuration repository implemented over Git or a similar DVCS.

A capability that the z2-environment has. With version 2.8 we intend to introduce much better support for distributed update scenarios.

Docker is essentially a Linux Installation Format

When you are designing a software system and you start worrying about how to get it installed on a machine to run, it is time to think about places where to put your code, configuration, supporting resource.

In reality however, you will have thought about that already during development as, I presume, you ran and tested your software. You did just not call it installation.

And when you came up with a concept to keep the various artifacts that are required of your solution in a sound place – you most likely made sure that there is a cohesive whole: In most cases a folder structure that holds everything needed to run, configure, customize your software.

Wouldn’t it be nice if that was what installation was all about: Regardless of your hosting operating system, installation of your software would mean to unpack/copy/pull a folder structure that provides your software, adapt the configuration a little maybe and be done?

Application in folder

That of course is not all there is to it. In many cases you need other supplementary software. Third-party libraries that come with the operating system. Or a database system that should run on the same host OS. It is here that we find crucially different philosophies of re-use depending on what technology you use.

If you are Java developer and your dependencies are Java libraries, you will typically bring them all with your application. In that case, if you even include the Java Runtime, you are pretty much there already.

JAVA application in folder

If you are developing application on the LAMP stack, to go for the other extreme, you typically depend strongly on third party software packages that are (typically again) installed using the OS defined package manager. That is, you blend in with the OS way of installing software.

LAMP application “in” Linux

Going back to the Java case and one step further: Suppose you come up with an extension model to your solution. Additional modules that can deployed with your application. They would need configuration and to have that and be a good citizen, they should adhere to your installation layout.

That is exactly what Linux does. If you want to be a good citizen on your Linux distribution you use the software packaging style of your target distribution and install in /opt/ keep your data in /var/lib, and your configuration in /etc.

But should you? Think about it: This is probably not the structure you use during development as you want to have the freedom to use different versions and variations without switching the OS. More dramatically, if you want to support multiple Operating System distributions, styles of configuration and scripts may vary. In fact, unlike what the drawing of packages installed above suggests, the artifacts of packages are spread out in the file system structure – in sometimes distribution specific ways.

Everything can get messy easily.

Things get messy and complicated anyway, because in that approach you rely on 3rd party software packages that are not distributed with the application but are expected to be provided via the distribution.

From an end user perspective, using today’s Linux package managers is great. From a developer perspective it is the classical dependency hell:

Every to-be-supported distribution and version will require a different dependency graph to be qualified for your solution!

Application with dependencies in Linux

Docker as a Solution Container

Many people look at Docker from the perspective of virtualization – with a focus on isolation of runtimes. But it is actually the opposite. Docker is a means to share operating system managed resources for applications that are packaged packaged with their dependencies in a distribution independent way. From the packaged application’s perspective, its execution environment “looks” like a reduced installation of a Linux distribution, that, by means of building Docker files, was completely defined at development time:

Dockerized Applications on Linux

By providing means to map shared resources like ports from the Hosting OS to Docker containers, Docker even allows to “trick” internal configuration (e.g. of the database port) into a shared execution.

Looking at it from Higher

If we take one step back again, what we actually see is a way of deploying a statically linked solution that includes everything except for the actual OS kernel. That is great and it solves the dependency problems noted above at the expense of somewhat higher resource consumption.

However, if there was a better standardized Linux base layout and better defined ways of including rather than referencing libraries and well defined “extension points”, e. g. if the Apache Web Server could discover Web Applications in “Application Folders”, if databases would discover database schemas and organize storage within the deployment, if port mapping was a deployment descriptor feature and so on … we would need none of it and have much more flexibility. If we had it on the level of the OS, we would have a huge eco-system opportunity.

It is this extensibility problem that any application server environment needs to solve as well – but never does (see for example Modularization is more than cutting it into pieces, Extend me Maybe, and Dependency Management for Modular Applications)

Summary

Creating a Docker image is not as simple as building a folder hierarchy, however, in essence, Docker provides a way to have our own Solution Layout on a Linux system while having strong control over third-party dependencies and still be easily installable and runnable on a variety of hosting environments. It is a cross-platform installation medium.

That is great. But it is really the result of wrong turns in the past. Docker found a dry spot in a swamp.

If it was safely possible to reliably contain required dependencies and configuration – a simple folder based software installation mechanism would have saved the world a lot of trouble.

Scrum Should Indeed Be Run Like Multiple Parallel Waterfall Projects

Normally I am not writing about processes and methodologies. Not my preferred subject really.

Lately however, I read an article (see below), that restated that agile is not like doing small waterfalls. I think that claim is misleading.

Over time, I have been working on all kinds of projects, ranging from proof of concept work to large systems for 24/7 production, from ongoing maintenance to custom extensions to existing solutions.

Each of those seemed to respond best to a different process approach.

For simple projects, it can be best to only have a rough outline or simply start from an existing example and just get going.

For maintenance projects, a Kanban approach, essentially a work stream comprised of  work items of limited conceptual impact can be best.

It gets more interesting when considering projects that are clearly beyond a few days of work and do have a perfectly clear objective. For example consider a specialized front end for some user group over an existing backend service.

As a paying customer, you would want to define (and understand) what should be the very specific result of the development effort as well as how much that will cost you. Therefore, as a customer, you naturally want development to follow a Waterfall Model:

It starts with a (joint) requirement analysis (the “why”) and a specification and design phase (the “how”). Let’s just call this the Definition Phase.

After Definition a time plan is made (implying costs) and the actual implementation can commence.

Once implementation completes the development result is verified and put into use – ideally on time and on budget. Or, as a simplified flow chart:

As we all know this approach does not work all to well for all projects.

Why is that?

Think of a project as a set of design decisions and work packages that have some interdependence, or simpler as a sequence of N work packages, where a single work package is always assumed to be doable by your average developer in one day. So, effectively, there is some prophecy, N steps deep, that after step X all prerequisites for step X+1 are fulfilled and that after step N the specification is implemented.

For very simple tasks, or tasks that have been done many times, the average probability of failure, that is, that the invariant above does not hold can be sufficiently small so that some simple measures like adding extra time buffers will make sure things still work out overall.

In software projects, in particular those that are not highly repetitive (think non-maintenance development projects), we typically find lots of non-repetitive tasks mixed with use of new technologies and designs that are implemented for the first time. In a situation like that, the probability of any sort of accurate project prediction from the start decreases rapidly with the “depth” of planning.

There are ways to counter this risk. Most notably by continuously validating progress and adapting planning in short iterations, for example in the form of a Scrum managed process.

While that sounds that we are discussing opposing, alternative process approaches, each having a sweet spot at some different point on the scale of project complexity, that is not so.

Execute Parallel Waterfalls

In fact: The gist of this post is that an agile process like Scrum is best run when considering it a parallel execution of multiple smaller waterfall projects.

Here is why: Many projects use Scrum as an excuse not to plan and design ahead of time, but instead only focus on short term feature goals – leaving design decisions to an implementation detail of a small increment. That is not only a great source of frustration as it propells the risk that even small increments end up brutally mis-estimated, it also leads to superficially designed architectures that – at best – require frequent and costly re-design.

Instead we should look for a combination of the two that on the one hand makes sure we make an upfront design of aspects of the overall project to an extent that we feel certain they can be done and estimated reliabl and yet, on the other hand preserve flexibility to adapt changed requirements when needed.

As a result we run multiple parallel waterfall projects, let’s call them part-projects that span one to several sprints, but use resources smartly when we need to adapt or for example work on bugs introduced by previous work.

Visualized simply as parallel execution lanes, processing several planned ahead part-projects, at sprint N we work on some subset

(B denoting a bug ticket) while at Sprint n+1 we proceeded and take in the next tasks:

The sprint cycle forces us to re-assess frequently and enables us to make predictions on work throughput and hence helps in planning of resource assignments. Our actual design and estimation process for part-projects is not part of sprint planning but serves as crucial input to sprint planning.

References

Z2-environment Version 2.7 is Available

I am happy to declare version 2.7 ready for download and use. Version 2.7 comes with a lot of small improvements and some notable albeit rather internal changes.

Please check out the wiki and online documentation.

Support for Java 11

Version 2.7 requires Java 9 and runs with Java up to Version 12 as well and supports a language level up to Java 11 based on the Eclipse Java Compiler ECJ 4.10 (#2021

Well… note the use of var in lambdas. The most noteworthy changes in Java 11 are its support and licensing model. Please visit the Oracle Website for more details.

Updated Jetty Version

As the integrated Jetty Web container required an upgrade to run with Java 11 as well, z2 2.7 now includes Jetty 9.4.14 (#2027).

Robust Multi-Instance Operation

The one core feature of z2 is that any small installation of the core runtime is a full-blown representation of a potentially huge code base.

This is frequently used when running not only an application server environment, but instead, from the same installation, command line tools that can make direct use of possibly heavy-weight backend operations.

In development scenarios however, code may have changed between executions and z2 previously sometimes created resource conflicts between long running tasks and freshly started executions. This has been fixed with #1491.

No more Home Layouts

The essential feature that make it easy to set up a system that has execution nodes serving different but well-defined purposes of a greater, coherent whole is the concept of system states. System states allow to express the grouping of features to be enabled in a given configuration and extend naturally into the component dependency chain that is a backbone of z2’s modularization scheme.

Unfortunately Home Layouts, that defined what worker processes to run in a given application server configuration, duplicated parts of this logic but did not integrate with it. That has been fixed with issue #1981. Now, worker processes are simply components that are part of a home process dependency graph. In essence, while the documentation still mentions Home Layouts, a home layout is now simply a system state and a home layout by convention.

More…

Check out the version page for more details. Go to download and getting started in five minutes or check out some samples.

 

Posted in z2

Microservices Once More

Imagine you were running a small company. For one line of products certain skills and some amount of working hours per unit are required. The people you work with have worked on many products previously and adopted a wide range of skills so that, as customer demand changes, responding to a change of workload is no problem.

Imagine now you would split teams by skill and turn each team into an independent company of its own that you outsource processing of all work requiring a certain skill to.

meshtoisolation

Would you imagine that to work better then your previous setup? Do you think that would be more efficient while equally responsive to changing needs? Would it make good use of resources?

I certainly would not. But that is what the Microservices Approach claims to be the thing:

Microservices are a software development technique—a variant of the service-oriented architecture (SOA) architectural style that structures an application as a collection of loosely coupled services. In a microservices architecture, services are fine-grained and the protocols are lightweight. The benefit of decomposing an application into different smaller services is that it improves modularity. This makes the application easier to understand, develop, test, and become more resilient to architecture erosion.

(https://en.wikipedia.org/wiki/Microservices)

Apart from the fact that this definition is overly packed with feel-good terms, it also gets causality upside-down.

Let’s read it in reverse: Yes, good modularity helps preserve architectural integrity and simplifies understanding, developing, and maintenance of a solution. But while good modularization helps identifying useful service interfaces, having service interface as such does not imply good or easily achieved modularization.

In fact, definition of modules or software components is best done by some subject of responsibility for some aspect of the solution, or – and that is now really important for this discussion – for non-functional requirements in the first place. The ability to identify service interfaces in this mix is mostly a result of the modularization at hand rather than the other way around.

Next, the fact that you identified service interfaces, does in no way mean that it is even remotely useful to distribute them in any loosely coupled (meaning via remote invocation interfaces) way. In particular, the more fine-grained services are defined, the harder and less meaningful it becomes to distribute them. Imagine services that rely on services that rely on other services.

Any remote interface introduced comes at a tremendous cost in complexity as you lose transactionality and simple refactor but introduce remote invocation performance and security problems, complex deployments, complex management and monitoring operations.

The thing is: As for outsourcing of business functions, there can be very good reasons to distribute application functions. Those are however never driven by discovery of some API that qualifies as service boundary but almost exclusively by non-functional requirements on components of the solution. For example:

You want to separate some expensive asynchronous load from the user interfacing parts of your application to avoid harming the user experience.

Your database system will be separated from your application server as it requires a single point of data ownership.

Some function requires specialized hardware or has license and security restrictions that prevents it from being embedded into an application directly.

Some parts of your application have much stricter robustness constraints and should be isolated from application failures. And very prominently:

Your system is integrating with some legacy system that is technology different or is not to be touched at all.

To Summarize…

  1. Do not use service interfaces as a driver of modularization – take a look from higher to identify responsibilies.
  2. Responsibilities drive good modularity not technological artifacts.
  3. Avoid the complexity of distributed deployments unless for clear non-functional requirements.

 

Simply Fooled

It is always good to start out with a diagram. Here’s one:

simply-fooled

It tries to say:

Tools that invest in “getting you started quickly” seem to always fail to stay productive or particularly useful in the mid and long term.

The typical example for this is visual programming tools or process designers of any kind. I yet need to see one that is useful beyond most simplistic demos.

I believe however the same problem is inherent with “scaffolding generation frameworks” like Ruby on Rails, Grails, and the like – possibly anything that claims “convention over configuration”.

While these approaches deliver some result quickly – eventually you will need to understand the framework, that is a pile of lots, lots of stuff that, once you lift the lid, can be totally overwhelming.

If you had started with much less, well-understood ingredients, you will be moving more slowly but at least on solid ground.

Not that I ever had an urge to use any of this kind of toolset and platform – I rather build from more essential building blocks that I understand and can keep under control – convinced that that leads to more robust, understandable, and maintainable solutions.

There is one example however that I use a lot and have a bit of a love-hate relationship with: Java Persistence Architecture (JPA).

And yet…

We use JPA a lot in most projects. It is rather convenient for simple tasks like storing records in the database and retrieving it for transactional data processing. The session cache is handy most of the time. Updating entities simply by setting values is nice. But you better not have complex queries, large transactions, or non-trivial relations. And better not ask for too much flexibility in combining modular domain models (a.k.a. Persistence Units in JPA speak). There has been so much stuff written on JPA, there is no need I repeat (the Vietnam of Computer Science however is a must).

Anyway, the point is: When you reach the limits of JPA, you better have your persistence tier already abstracted in a way that it does not hurt too much replacing the actual data access with other techniques where needed, which comes at a price.

Let’s say. If I had not invested in JPA previously, it would probably feel out of place and unwieldy now. Given the style of clean persistence API that we tend to implement on the application level these days, I would rather use some more accessible “bean mapper” tool.

Meaning…

Here is my little theory: Putting your main focus on solving 70% of the problem quickly compromises so hard on the completeness of the approach that the remaining 30% of the problem get overly expensive for its users. Worse, the “left-overs” may require to break your architecture, require unexpected 3rd party stuff, etc.

In addition, if you have made a choice for the early win, the price to pay later on may be due at the worst point in time, just when your architecture has to stand the test of growth.

Hence:

  • Don’t be lured by promises of simplifications of what is inherently not simple
  • Always try to look under the hood. It is not necessary to understand everything down to the bare metal – but knowing how much there is, helps getting an idea
  • Get into it, avoid what cannot be mastered in time – it will be mastered later!

Wish you a great holiday season!

Z2-environment Version 2.6 is ready – for download!

Finally. I am happy to declare version 2.6 ready for download and use.

Version 2.6 comes with a lot of small improvements and some due follow up on what changed in the Java world.

Aside from regular software maintenance, there is one bigger change: The z2-base distribution.

If you are wondering what you might be missing, read on!

Check out the updated wiki and online documentation.

Java 9 and Java 10 Support

One of the more obvious changes is that v2.6 requires Java 9 and runs with Java 9, 10, and with Java 11 as well. Language-wise, this release is on Java 9 or Java 10 depending on the runtime used.

Java 9 introduced a new module system feature into the Java core model (see the Jigsaw Project). Unfortunately this system has even more flaws than OSGi had as far as considering its usefulness for actually application development. Lets put it this way: You will absolutely not be bothered with Java 9 modularity when using Z2 modularity.

Z2 Distribution for Download

Beyond many useful upgrades and minor improvements, some shown below, the one important “innovation” is that z2 now has a download page and a downloadable distribution.  One of Z2’s defining features is to pull from local and online resources and prepare anything required for running all by itself. That philosophy was very visible in the way we promoted the use of Z2 previously: Only checkout the core and pull anything else from (our) repositories. Problematic about that approach is that it makes creating your own system a unnecessarily complicated procedure. Providing a distribution simplifies getting started on your own and gives you a clean set of assets to import. Last but not least, we provide you with a comprehensive overview over all included 3rd-party licenses:

More Desktop for the GUI

The Z2 GUI that is really just a log container with buttons for the few interactions you need with Z2 is now a little more friendly to the eye by offering font scaling by pressing Ctrl and htting “+” or “-” or turning the mouse wheel.

Expert Features

Linked Components

Link component work like symbolic file system links in that they move the visibility of a component definition to a different module and component name. The link may actually add additional information, such as dependencies, state participation and more. (documentation)

Parameterized tests and suites in z2Unit

Z2Unit, the integrated JUnit-based testing kit for seamless as-easy-as-ever in-container testing had some gaps due some omissions in JUnit’s internal APIs. Z2Unit strives to move the convenience of a local class test to a deep integrated, painless in-system testing. (blog)

Finer Control on Compile Order and Source-Jar filtering

Some optimizations have been added to provide finer control on use of extension compilers (e.g. for AspectJ) for single Java component facets (API, impl, test). Previously z2 also made source JAR files visible to the application. This was neat for development but expensive and in some cases outright problematic at runtime. (documentation)

Clean Up of Third Party Libraries and Various Upgrades

Z2 now integrates Jetty 9.4.8. and JTA 1.2 for its built-in transaction support. Samples and sample dependencies have generally be upgraded to recent versions of e.g. Spring and Hibernate. Check out the version page.

Posted in z2

Updates in Modular Data

In a modular system, data also tends to be modularized. In the post Modularization And Data Relationships we looked at cross-subsystem and cross-domain data dependencies and how to make those available for efficient querying.

In this second part we discuss aspects of update, in particular deletion, of data that other data may depend on. Remember, we are considering a modular domain in which some central piece of data (we will call that dependency data) is referenced by extension modules and domain data (we call that dependent data) that were not specifically considered in the conception of the shared data.

In our example we chose a domain model of some Equipment management core system (the dependency data) that is referenced by extension modules and data definitions, in our case Inspections and Schedules (the dependent data), of a health care extension to that core system:

When equipment data changes or gets deleted dependent inspection data may easily become invalid. For example, a change of equipment type may mean that inspection types or schedules needs to be updated as well. And of course, inspections for deleted equipment will be obsolete altogether.

There is also the technical aspect of foreign key relationships between the inspection database table and the equipment database table. If the foreign key has become obsolete and unresolvable, all equipment information that was previously resolved by joining database tables via the foreign key relationship would be gone and cannot be presented to users anymore.

Simply put there are two choices of how to handle deletions:

  • Do actually not delete records but only mark records as unavailable.
    In that case, for dependent data the original data set is still available and follow up actions may be offered to users. But the domain model becomes more complicated (e.g. w.r.t. uniqueness constraints).
  • Delete and simply cope with it.
    That means: The inspection application needs to be completely coded towards the case that dependency data may be gone. And whatever is needed to supply users with necessary information to plan follow-up action needs to be stored with the dependent data.

Similarly for updates:

  • When data updates to an extent that dependent data has become invalid this situation needs to be discovered when dependent data is visited again to plan and perform whatever corrections are required.

While these approaches look theoretically sound, there is much to be improved. Firstly, developing an application to cope with any kind of inconsistency implied by changes of dependency data will be rather complex in the general case.

Secondly, from a user perspective, it will typically be highly undesirable to even be allowed to perform updates that will lead to situations that require follow-up actions and repairs without being told beforehand.

Hence:

How can we generically handle updates of dependency data not only consistently but also user-friendly?

Here is at least one approach:

A Data Lease System.

With that approach we use a shared persistent data structure that expresses its relationships between dependent and dependency data explicitly and that is known by all subsystems.

In the simplest case, a lease from a dependent data record of a dependency data record holds:

  1. The dependent data record key and a unique domain type identifier for which that key is valid.
    No other present or future domain use of that identifier should be possible.
  2. Likewise the dependency data record key with a corresponding domain type identifier,

( <dependent type>, <dependent key>, <dependency type>, <dependency key>)

In our example this could be something like this:

( “inspection.schedule”, “11ed12f2-8c94-41a7-9143-8d4ff6070f”, “equipment.equipment”, “c5f92006-dc40-41aa-97d0-3ae709b4aca6”)

In other words: From a database perspective, the lease is simply a shared join table structure that is annotated with additional meta-information on “what” is being talked about.

While updating lease information is delegated to a shared service in order to provide additional handling for one-to-many, many-to-one, or many-to-many relationships, cleanup and generic check methods, from a read-access perspective, the principles of Modularization And Data Relationships can be applied in full and the table structure should indeed be used as a join structure establishing the actual relationship.

That is, in the example, the health care to equipment relationship is now expressed via the data lease.

Now, if we have this system in place numerous improvements are readily available.

Giving Users a Choice

First of all the equipment application can now trivially determine, if there are dependent data sets, how many, and of what kind (via the additional domain type identifier).

Based on that information the equipment application could offer choice to users before applying an update. For example, upon a request for deletion, the application may inform the user that the piece of equipment is still referenced by dependent data and refuse the deletion.

For updates this does however not remove the need for understanding the implications on dependent data.

The approach becomes most useful if we pair it up with an extension point approach, in which the domain type identification proposed above is used to look up implementations of callback interfaces provided by the extension modules of the subsystem owning the dependent data.

Giving Power to the Extension

Once we can generically identify the owner of the lease, we can pass on some decision support to the dependent subsystem. In particular: Upon deletion or update, the dependent subsystem may be made aware and analyze exactly what the implications would be.

We can remove all needs for lazy responses to changes of dependency data by involving the dependent subsystem in the update processing in terms of validating the update and eventually handling the update.

In our example, a change to equipment type would be validated by the inspection application. The type change may be unacceptable upon which the user should be informed that deletion is not possible or that the update would disable an inspection schedule. Or, in some other cases the user would be told that it is OK to proceed, but that some settings of dependent inspection data would be altered automatically.

Similarly, a deletion of an equipment record may be rejected, or the user be told that all related inspection data would also be deleted as a consequence.

In any case: Inspection data would be consistent and no lazy checks required anymore.

There is still much to be defined for the specific software system of course. Enjoy designing!

Modularization And Data Relationships

A lot of posts in this blog are on structure largish applications via one or another modularization approach. Size is not the only reason to modularize though. Another important reason to split into independent subsystems is optionality. For example given a core product, industry specific extensions may exist that rely on the core system’s APIs and data models but are made of significant code and data structures to justify an independent software life cycle. While code dependencies have been treated extensively in previous posts, we have not looked much at data relationships.

This post addresses data model dependencies between software subsystems of a modularized software system.

Data Model != Code

When we talk about data model dependencies between subsystems, we are talking about data types, in particular data types representing persistent data, that are exposed by one subsystem to be used by other subsystems.

Exposing data type, i.e. making the knowledge of their definition available between subsystems is typically done using an API contract – be it using a regular programming language, or some data description language such as XML schema.

Making data types available between subsystems is only one side of the story however. Eventually data described by the model will need to be exchanged and combined with other data. Typically describing data and providing access to data is part of a subsystem API and combining data from different subsystems is done by the caller:

In many cases, however, this is not good enough: If subsystems share the same database, losing the ability to query data across domain definitions of subsystems can be a real showstopper for modularization.

An Example

Consider the following hypothetical example: An application system’s core functionality is to manage equipment records, such as for computers, printers, some machinery. An extension to the core system provides functionality specific to the health care industry. In health care, we need to adhere to more regulation than the core system offers. For example we need to observe schedules for inspections by third-party inspectors.

The health care extension has hence its own data base structures that refer to data in the core system. It uses database foreign keys for that. That is, in its database tables we find fields that identify data records in the core system’s database tables.

Now, for a single inspection schedule, that refers to one or more pieces of equipment, looking up the individual equipment via a core system poses no problem. Answering a question such as what is the top-ten pieces of equipment of some given health care inspection schedule that have the most service-incidents is a different case. Providing efficient, sort-able access to any such combined data to end users via independent data retrieval is hard if not a pointless goal in the general case. This is what joins in the relational database world are for after all.

Joining Data between subsystems

Given the construction above, we want to extend the core systems API to not only support single data lookups but also a query abstraction for more clever data retrieval, in particular so that combined data queries, i.e. SQL joins, would be computed on the database rather than in memory.

Unfortunately, a query interface that would include data defined outside of the core application’s data model can typically not be expressed easily.

We have a few choices on providing meaningful access to joinable data between subsystems however.

Exposing Data

For once the core application may simply document database table or database view structures to be used by other subsystems for querying its data.

This way, the health case extension would extend its own database model by those portions of the core systems data base model that are a) of relevance to the extension and b) part of the new API contract that includes these data access definitions. This would only be used for read-only access as there is no natural knowing by the extension what other data update and validation logic may be implemented in the core system.

Something integrated is possible using the Java Persistence API, and should be similarly (if not better) available in Microsoft’s .Net framework via the LINQ and LINQ to SQL features.

Views in Modular JPA

When using the Java Persistence API (JPA), an underlying relational data model is mapped onto a set of Java class definitions and some connecting relationships. Together, the mapping information, some more provider and database specific configuration, is called a persistence unit. At runtime all operations of a JPA Entity Manager, such as performing queries and updates, run within the scope of a persistence unit.

In our example, we would have a core persistence unit for equipment management and one persistence unit for the health care extension. As a principle, both would be private matters of the respective subsystem. We would not want the health care extension to make use of non-public persistence definitions of the core model as that would break any hope for a stable contract between core and extension.

As such there is no simple sharing mechanism for persistence units that would allow to expose a subset of the core persistence unit to other subsystems as part of an API contract.

A close equivalent of a simple database view that is still within the JPA model however is read-only JPA classes that can be included into an extension’s persistence unit definition.

That is, we would have the same data types used in different persistence units. For the extending subsystem, those types appear as a natural extension to its own data model and can hence be used in joining queries, while defined in and hence naturally fitting to the core’s domain model.

As a mix of Entity-Relationship and Class Diagram this would look like this:

Highlighting the scope of persistence units:

Now we are at a point where data access and sharing between subsystems is well-defined and as efficient as if it was not split into separated domain models.

It’s time to move on to the next level.

Data Consistency – or what if data gets deleted?

When split into subsystems, also responsibility of data management gets split. Let’s take a look at our example.

In the health care extension to the equipment management system, inspection schedules, stored in the extension’s domain model refer to equipment data stored in the core application’s domain model. Based on the ideas above, the health care extension can efficiently integrate the core data model into its own.

But then, what happens on updates or on deletions issued by the core application’s equipment management user interface? It would be simplistic to assume that there are no restrictions imposed by the health care extension. Here are some possible considerations:

  • Deletion of equipment should only be possible, if some state has been cleared in the extension.
  • Updating equipment data might be subject to extended validation in the extensions
  • Can the extension subsystem be inactive and miss changes by the core application so that it would end up with logically inconsistent data?

We will look into these exciting topics in a next post: Updates in Modular Data