donderdag 20 juni 2013

Java oracle small success

Having installed the oracle version of java seems to not give the error.
I will try this out a bit more thoroughly in a test that will run during the night.

EDIT: Tests have shown that indeed this is a success. Which makes us conclude that there is a problem in the OpenJDK.

Some more results

Since the error behavior only propagated while running the dmtcp_restart_script.sh through a Java process builder but I never got any error messages or any indication of what went wrong.
I decided to build a small test system and mimick the behavior without having to start up the complete CBAS system.
This system exists out of the process builder and the S3Streamgobblers running the restart script.

And lo and behold the error appeared again but now with additional error output !
Unfortunately it seems to be a Java JVM problem :

20/06 17:36:43 :: 418
20/06 17:36:48 :: 419
20/06 17:36:53 :: 420
#

[error occurred during error reporting (printing fatal error message), id 0x4]

#
#  SIGILL (0x4) at pc=0x00007f2960897dd8, pid=40000
[error occurred during error reporting (printing current thread and pid), id 0x4]

#

[error occurred during error reporting (printing Java version string), id 0x4]


[error occurred during error reporting (printing problematic frame), id 0x4]

#
[error occurred during error reporting (printing core file information), id 0x4]


[error occurred during error reporting , id 0x4]


[error occurred during error reporting , id 0x4]


[error occurred during error reporting , id 0x4]


[error occurred during error reporting , id 0x4]


[error occurred during error reporting , id 0x4]


[Too many errors, abort]

I am currently testing it out with the java JVM from Oracle to see if this shows the same behavior.
I also want to compare this with the regular ProcessBuilder without S3StreamGobbler.

woensdag 19 juni 2013

Manual snapshotting fails

As a final attempt to get some trustworthy behavior from DMTCP, I changed the checkpointing system to use dmtcp_command .
This means that instead of letting DMTCP automatically checkpoint after a given period of time, I will now time it myself and issue the checkpoint command.
Something that can be done through the usage of dmtcp_command -c

This causes restarts to sometimes crash with a 134 error code.
It is very frustrating to not have consistent behavior but I have no other means of explaining it.
In some cases the restart works in others it does not with the above error code as a result.

Next to that I also noticed that whenever a restart works, it will not take a second snapshot anymore.
So the new method through the command part is something I will remove again.

Whenever I manually try it seems to be working, so perhaps it has something to do with output not being read?
I have not found any information about this causing crashes to the child process.

dinsdag 18 juni 2013

All java tests failed once again

For some inexplicable reason all of my java tests have failed both the VM checkpointing and the DIRECTORY archiving strategy.

They simply stopped the checkpointing procedure for some unknown reason.
Examination of the output gives a lot of restarted calculations.
But something is very odd about them, they always restart from the same point and then continue to the same point whereafter they restart once more.

For example: a counter that should be going from 0 to 3600 now is stuck at 360 to around 838.
Then it starts again from 360. No output of checkpoints in between either.
This is very strange because going from 360 to 838 with 5 seconds in between each count would reach  39 minutes and 45 seconds. While a snapshot should start after 30 minutes.


Whenever a manually start and restart is executed, there seem to be no problems.

More tests have shown that indeed checkpointing is somehow not working anymore after a restart.
I want to declare DMTCP as too buggy for further use.
With some luck you can have your application fully operational but this is a very annoying factor.
No clear indications were find for the noticed behavior.
More thorough knowledge of DMTCP's internals is required or simply waiting for a more stable build.

zondag 16 juni 2013

New testversions

The backup system is in place and has already been doing several testruns.
Unfortunately not with very good results. The problems seem to keep on appearing.
I have also noticed by accident that when running multiple checkpoints before a single restart, it seems that the amount of errors is greatly reduced.
In fact when taking snapshots every 5 minutes (which nearly caused problems while using the VM checkpointing strategy since it takes around 2.5 minutes to perform this checkpoint and in worse case could even take longer.), there was not a single error in the restart procedure.
But the runtime of the VM checkpointing strategy was doubled.

I also coincidentally discovered that DMTCP has changed their tests to incorporate my proposed solution but have not heard much more about this. My current Java tests show no sign of problems with DMTCP when my fix is in place. (-XX:-UsePerfData)

dinsdag 11 juni 2013

Test problems

Errors keep on occurring during the different tests.
A last problem that was noted is that a started worker seems to be unable to inform the Master.
This creates a loop of the master trying to restart the worker and the worker being unable to inform the master.
Strange thing is that the ping messages keep on arriving at the master so we can be sure that the worker has successfully started. So something inside the Master is blocking or the SNS system has decided to stop working. Since I suspect the Master being the problem, I have currently threaded the message handling part in order to prevent further blockage.

I have also noted that restart errors also occur with the directory checkpointing methodology.
Albeit less frequently.

The VM snapshotting method still behaves with the same insecurity. It would be nice if we could check if a given snapshot is "good". Perhaps something that can be suggested to the DMTCP developers?

The developers have also contacted me again about the Java issues and have recognized that there are indeed problems. Not only with Java but some others as well. They are currently trying to fix those.

I have been thinking about the usage of the next to last snapshot and will start implementing this method.
Until now I have been hesitant to use it due to additional management difficulties and the large overhead of going back so much. But the VM checkpointing is just not stable enough and really requires it.

vrijdag 7 juni 2013

Scheduler

The scheduler that will be used is the one written by Kurt Vermeersch.
This scheduler uses static data about Amazon EC2 and information about a task in order to schedule it.
It is mainly suited for scheduling multiple tasks that need to be run at the same time.
An initial integration of the scheduler will not be able to use this feature though.
Some new considerations should be made :

- In the current CBAS system single tasks are supplied to be executed.
- These are then scheduled to execute with a minimal cost.
- This scheduling is done independently from the other tasks.

This is actually a basic implementation and should be reconsidered.
I would suggest a system where the master waits for a given amount of time for all the jobs that arrive and then combine them.
Even better still would be the usage of all the current tasks in the scheduling process, even those that are already running.
But this would require changes to the scheduler.
If it could take into account the time a given task is already running, the proximity to the hourly payment and the proximity to the deadline, it should be possible to temporarily halt a given task from executing because the scheduler might know about a cheaper period that is approaching.

Another remark is concerning the 2 workload models that are supported by the scheduler from Kurt.
Only the first one is supported at the moment.


First attempt to integrate the scheduler is implemented.
Some more test cases are executing, with java tests included.