spring batch quartz example

An example which is meant to show how Spring Batch and Quartz can address the key issues of batch processing and job scheduling in a clustered environment.

ilya40umov

Java

spring-batch-quartz-example

Summary

An example which is meant to show how Spring Batch and Quartz can address the key issues of batch processing and job scheduling in a clustered
environment.

Used Technologies

Spring Framework(IoC, JDBC, Transactions, etc.), Quartz Scheduler, Spring Batch, MySQL.

Project details

How to run

The application contains two “main” classes:

org.sbq.batch.mains.ActivityEmulator does exactly what it’s supposed to, it emulates some activity from users. This part is only needed to
keep adding some new data to DB. You should run only one instance of this class(meaning that this part does not deal with clustering).
org.sbq.batch.mains.SchedulerRunner is meant to be run in multiple instances concurrently in order to simulate a bunch of nodes in a cluster.

Simulated environment

The example is meant to test the following environment: several servers(at least 2 nodes) running in a cluster against RDBMS(hopefully clustered)
which have to perform certain batch tasks periodically and have fail-over, work distribution etc.

Implemented Jobs

CalculateEventMetricsScheduledJob: calculates a number of occurrences for each type of event since last job run and updates the site
statistic entry(which hold metrics of site since its start); Triggered each 5 minutes; saves where it finished(time for which it processed);
misfire policy ‘FireOnceAndProceed’, meaning that only one call to the job is needed for any number of misfires;

CalculateOnlineMetricsScheduledJob: calculates total number of users online, number of users jogging, chatting,
dancing and idle for a certain point of time; Triggered each 15 seconds;
uses ScheduledFireTime from Quartz to identify for which point it should calculate the metrics; misfire policy ‘IgnoreMisfire’,
meaning that all missed executions will be fired as soon as Quartz identifies them(so that the job can catch up);
this job randomly (in ~ 1/3 of cases) throws an exception(TransientException) in order to emulate network issues;

Addressed Scenarios

Scheduler: No single point of failure

Use case: Make sure that if one node goes down, the scheduled tasks are still being executed by the rest of the nodes.

How supported/implemented: Quartz should be running on each machine in a cluster.
Each Quartz should be configured to work with DB-backed JobStore and clustering should be enabled in Quartz properties.
When at least 1 node with Quartz is up, the scheduled tasks will keep being executed(guaranteed by Quartz architecture).

Steps to verify: Run init.sql. Start one instance of ActivityEmulator(optional). Start several instances of SchedulerRunner.
Watch them executing jobs. Kill some of them. See how load is spread between the nodes which are left running.

Scheduler: Work distribution

Use case: Make sure that the tasks are getting distributed among nodes in the cluster.
(This is important because after a certain point one node won’t be able to handle all tasks).

How supported/implemented: Quartz with DB JobStore performs work distribution automatically.

Steps to verify: Run init.sql. Start one instance of ActivityEmulator(optional). Start several instances of SchedulerRunner.
Looking at the log file on each instance of SchedulerRunner verify that the tasks are executed on each node(The distribution is not guaranteed to
be even).

Scheduler: Misfire Support

Use case: Make sure that if all nodes go down and then after while at least one is back online,
all of missed job executions(for particular jobs which are sensitive to misfires) are invoked.

How supported/implemented: Quartz with DB JobStore performs detection of misfired jobs automatically upon startup of the first node from
cluster.

Steps to verify: Run init.sql. Start one instance of ActivityEmulator(optional).
Start several instances of SchedulerRunner. Stop all instances of SchedulerRunner. Wait for some time.
Start at least one instance of SchedulerRunner. See how misfired executions are detected and executed.

Scheduler: Task Recovery

Use case: Make sure that if a node executing a certain job goes down, the job is automatically repeated/re-started.

How supported/implemented: This use case is tricky because a server crash is likely to leave the job in unknown state(especially if it
writes data into non-transactional storage like Mongo). For now I assume the simplest use-case where the job just have to be restarted and we can
ignore the fact of possible data collisions. Using requestRecovery feature from Quartz and SYNCHRONOUS executor(which uses Quartz thread for
performing batch processing) we can rely on Quartz in terms of identifying crashed jobs and re-invoking them on a different node(or on the same one
if it’s up and the first one to identify the problem).

NOTE: I think that a more smooth transition for job recovery can be made by storing job state in ExecutionContext which will be picked up by
Spring Batch when you create a new execution for the same job instance.

Steps to verify: Run init.sql. Start one instance of ActivityEmulator(optional). Start several instances of SchedulerRunner.
Look at the logs and find out which SchedulerRunner is running LongRunningBatchScheduledJob, kill it. See how after a while another job logs the
message that it’s picked up the job(it can also be verified in DB by looking at executions table).

Spring Batch: Retries Support

Use case: Retry a job if it fails due to a transient problem(such as a network connectivity issue, or DB being down for a couple of minutes).

How supported/implemented: Spring Batch provides RetryTemplate and RetryOperationsInterceptor for this purpose,
which allow to specify number of retries, back-off policy and types of exceptions which considered retry-able.

Steps to verify: Run init.sql. Start one instance of ActivityEmulator(optional). Start several instances of SchedulerRunner.
In logs you should see “calculateOnlineMetrics() - TRANSIENT EXCEPTION…” which indicates that exception has been thrown but a method of Service
class was retried by RetryOperationsInterceptor.

General: Monitoring

Use case: There should be an easy way to get the following info at any point in time:
list of all jobs which are being executed at the moment, history of all job executions(with parameters and execution results success/failure),
list of all scheduled jobs(e.g. next time a particular job runs etc.).

How supported/implemented: In fact all this information can be obtained from Quartz and Spring Batch abstractions in java code. For some
cases you can look into DB and find out the status of running jobs, history etc. There is also Spring Batch Admin web-app which can be used for
this purpose.

Steps to verify: see ‘How supported/implemented’ section.

Un-Addressed Scenarios

General: Execution Management

Q: How do I manually re-execute a particular job(with given parameters) if it fails completely(i.e. no luck after N auto-retries)?

A: Not implemented at the moment. In fact we should consider using JMS in order to deliver a command to a cluster of batch processing nodes.
Then a JMS listener will trigger a specified Spring Batch job.

General: Graceful Halt

Q: How can I signal to all nodes to stop, so that I can deploy a new version of software, do maintenance etc.?

A: I think this is also should be done via JMS message(send to a topic!). Upon receiving of a message each node should: a) stop Quartz b)
wait for all nodes which don’t support re-start c) stop all nodes which support re-start (the jobs which can save the point where they left and
resume from that point). Also see http://numberformat.wordpress.com/tag/batch/ for some info on graceful stop.