After having look at the various application within most of the company it can be either a product/service applications, currently there is no mechanism through which we can capture/record the failures, performance issue and statistics to improve the product until reported by customers and/or accidental found within the team. Also, there is no mechanism to trigger automatic repair/warm–up scripts and notify product owners in the event of application failover. Hence proposing a product similar to Zabbix to enable and manage the application/services in a proactive manner rather waiting until it failover.
Lets consider we are releasing a product/service to various clients/customers after all the testing(Regression test, Performance test, Pen test, load test,UAT) etc. and we are happy that the customers have not reported any issues. After couple of months, clien/customers are pushing for one patch with bunch of features to be implemented as soon as possible. Sine it’s a patch we decide not to go through the load/performance test to be carried out for this release and customers are happy. At the background after a month, one of the administrator login to the server and figures out cpu/memory are under heavy spikes thinks its quite usual because it does not break the server/application services.
Problem(not until something goes wrong)
As a product owner, I wont know until the server or application crashes. When this happens we have a big panic button pushed and running around to find why this happens.
Do we really have the data to go back when this problem started up?
Is this because of the patch?
this might raise so many questions in the WAR room.
We look at various logs and ask questions to customers/users to understand if they noticed any performance issues if so when. Then identify patches applied and windows updates applied etc.. also all the related application code changes to various components and lot more. We need to invest lot more time/resources to figure out one of the change in one of the third party component for change for x application have affected y application.
WHAS (When Happened Act Smart)
With Zabbix/similar tool in place with right parameters/check points configured, we would have got notification about the cpu/memory spikes and in turn we should be able to make appropriate action and at the same time its easy to identify the changes happened around that time frame. even if we miss notification we will be able to go back and check the metrics to identify the history, with the advantage of having data in place. Quick response, Easy troubleshooting and customer/user never know there was a problem.
If some sort of similar tool is already in place please enable/promote the usage wider with various team to take advantage and put the company at top in providing the product/services efficiently. If no tools in place lets get started with Zabbix (similar tool). Time for action is now, Its never too late to do something.