It has been a little while since the last blog and I have been debating what topic I should write about. The other night this topic suddenly came to mind, I think there are a number of reasons for that. Some of the reasons are as follows:
I will talk about some concepts, which might or might not be supported by software. I am not seeking to promote any specific software but rather a concept. What one does need to bear in mind is that even though most or all of the software required to achieve the unattended operation of AutoSys is available from CA, other vendors do also have products in that space. The one really big advantage of using CA software is the close integration between the CA products.
For clarity I will share my idea of the meaning of Dark site operations. Dark site operations is/was also known as lights out operation. This means that the idea is to have as much of your data center operations automated as possible to enable your data center to operate with no or little human intervention. The initial term of lights out operation came from the idea of having your data center operations automated to such a degree that you turn off the lights and lock the data center without having to return. We all know that this is a rather ambitious desire, but I believe that we can get very close.
I will focus on achieving an unattended AutoSys operation, not your entire data center. To achieve the unattended operation of AutoSys I have broken it into several sections below.
Job Failure monitoring
The first function or section we need to consider is Job Failure monitoring. This is possibly one of the easier tasks to do. Firstly we need to have systems or network monitoring solution like CA Unicenter NSM or Netcool to which we can send the SNMP traps for the job failures. We also need to have a helpdesk system such as CA Service Desk or Remedy. Again here the advantage of the CA solutions is the out of the box integration with them. When there is a job failure we need the Systems monitoring tool to be notified. Ideally at that point you want to enrich the data with information such as Re-run information, call out procedures, helpdesk ticket queues. Once we have enriched the data we want to automatically raise a issue on the helpdesk system and have it assigned to the correct queue for resolution. The helpdesk system will naturally take care of the escalation and resolution SLA monitoring. Once the relevant support team has fixed the error that caused the job to fail, we need to have a mechanism for the restarting of the job. The ideal solution would be a Process automation tool where the automated process could be triggered and then it will then acquire approval for the restart of the job and once that has been obtained it will automatically issue the FORCE_STARTJOB to re-run the job. The circle needs to be closed with the resolution of the helpdesk ticket when the job completes successfully. The process should also allow for a job not being re-run but the next job being force started, or the job just being changed to a success status.
AutoSys Maintenance
Now we need to consider the maintenance of AutoSys itself. Firstly there is the DBMAINT tasks that need to be performed, by default they are run automatically by AutoSys. There is no notification of any failure other than log entries as the DBMaint is run internally to AutoSys. So the first thing that should be done is to split the DBMAINT script into a number of AutoSys jobs so that if there is a failure the normal job failure process takes over and someone is notified to resolve the problem. Some additional tasks should also be added to the maintenance jobs, some being the clean_files utility and the chase utility. There are also some log files that need to be archived, and/or deleted and a script should be developed to do this.
AutoSys error monitoring
Firstly you need to have a similar automated process in place for AutoSys related errors that are sent via SNMP traps to the systems management tool as that which exists for job failures. Where possible we can have automated recovery processes created as the resolution. An example would be if your AutoSys environment is running in HA and there is a failover. When the failing component is repaired you want an automated process to do the autobcp to re-synchronise your databases and start AutoSys back up. The AutoSys log files should also be monitored for error or failure messages that do not generate SNMP traps. Something else that you might want to monitor for and generate alerts for is any of the agent machines that go offline as that could cause delays to the batch.
AutoSys performance monitoring
You need to have some performance metrics gathered from AutoSys and automated trending done to create alerts if the performance goes outside of your acceptable range. The performance monitoring I am talking about here is Average latency, un-processed queue length etc. In most instances you can have some automated data gathering when performance is outside of the accepted boundaries which will assist the person who would get the ticket assigned. In fewer cases there might be some automated mitigation processes that could be run.
Automated SLA monitoring
For the SLA monitoring you would use a tool like JAWS which not only does the SLA monitoring and reporting, but can also generate alerts for SLA breaches or even possible SLA breaches. The alerts can be sent to an SNMP manager and the whole automated ticketing process can be utilised.
Automated reporting
The business objects reporting server provided with AutoSys R11 allows you to schedule the running and delivery of defined reports. Reports can be published to a website or to a SharePoint server. Alternatively reports can be saved on a central location or emailed to a user or distribution list. Access can be granted to users who require reports so that they may run them as and when required.
Automated job promotion between environments
The next big hurdle is automating the promotion of jobs from one environment to the next. Normally there would be 3 or 4 environments that jobs need to migrate between. The typical environments would be Test or Dev, UAT or Staging and Production. There might be a Integration testing environment between Test/DEV and UAT/Staging. The ideal way to automate the promotion process is through a Business or IT process automation tool like the CA IT Process Automation Manager (ITPAM). Using an IT process automation tool means that it is more structured and there is an audit trail of each promotion. Ideally you would want to include version control for the JIL files so that you can rollback to any known working version of the JIL. If you do not have an IT process automation tool then it can be done using a script and AutoSys jobs, I have actually implemented such a system, which I hope to migrate to an IT process automation tool in the not too distant future.
Automated take-on
This section follows along similar lines to the Automated job promotion section above. When I talk about automated take on, I am referring to new applications being added to an AutoSys environment. Here again I would suggest that an IT process automation tool is the best way to achieve this. Some of the tasks you would need to automate here would be defining the new machines to AutoSys via JIL, adding windows functional accounts to autosys_secure, and defining all the EEM policy required for the new application.
Automated security management
This follows on from the automated take on section. Here we want to automate all the processes around EEM policy changes. An example would be if a user moves department and thus they will be working with a different set of AutoSys jobs, the EEM policy needs to change to remove the user from the old jobs and be added to the new ones.
Conclusion
All of the above automation processes should include audit trails, automated approval systems and reporting. If we had to achieve all of the above we would not need any AutoSys BAU staff, only AutoSys support staff for when there was a problem with AutoSys itself. The reason for not needing BAU staff is that job failures would automatically be resolved by the relevant support teams, and all other BAU work is automated. Your BAU staff can then start working with the users to optimise their batch processing and take full advantage of the abilities of AutoSys.
I do not know of anyone who has gone the whole way, but some environments are certainly on their way and have achieved a fair amount of automation.