Skip to main content

IT-Checklists.com - The eBook-Shop with Templates for Professionals
logo
Skip main navigation
Template Systems Operations Manual Template Data Centre Operations Manual Data Migration Checklist Nonfunctional Requirements Interface (EAI) Checklist Server Upgrade / Migration Checklist Application Upgrade / Migration Checklist Application & Server Inventory Template Release Checklist Outage Planning Application Cloning Application Retirement Application Health check Archiving Requirements Disaster Recovery (DR) Technology Selection Maintaining & Running DR-Protected Systems Backup OLA / SLA Database OLA / SLA DBA Job Description Database Health Check Standby Database
DevOps Application Support Data Migration OLA / SLA Operations Level Agreement System Documentation Project Management Deployment Backup Quality Assurance Compliance and Standards Database Administration Data Centre Robustness Requirements Start-up PhaseAchieving Operational ReadinessStabilized Operations
2022-May-02

      When did you last time ....



When did you last time check for inactive backups?

Example:

[1] Initial Situation

  • A fully automated backup system is in place,
  • backup schedules for all servers have been defined and activated and are working fine since several months or years.
  • A professional monitoring system monitors the backup system.
  • [2] Due planned maintenance work on server Uxxx the nightly automated backup schedule for this server has been manually deactivated.

  • The maintenance work has been successfully completed,
  • but it has been forgotten to enable the backup schedule for this server Uxxx again.
  • This has been detected as late as 2 weeks later, when the author of those lines asked this question.
  • The monitoring system did not raise an alarm, because that backup schedule did not raise an error. - A job which is not started can't fail and raise an error...

    [3] Conclusions

    The root cause for this operational mistake was an incomplete (or not existing?) checklist for maintenance work. But just ensuring that the next version of the maintenance work checklist is complete, and being followed, is not enough. An additional approach to detect never started backups is not easy, but that needs to be done – "whatever it takes". The inability to recover when needed would show that – but then it's too late.
    ApproachComments
    Raise awareness This alone is not sufficient, but it is one small additional contribution.
    In case that you already have a generic "Post Implementation Review" document or checklist, then add the questions:
    "What (jobs, backups) has been temporarily deactivated?"
    "Who did re-activate those?" - Please confirm for each of those.
    Don't deactivate the backup schedule, just change the start-time to a later time. This can be very dangerous in case that maintenance work takes longer than planned.
    Create a report like

    SELECT count(*)
    FROM
    (
    SELECT servername
    FROM backup_jobs
    WHER backup_end_time
    between trunc(sysdate)-7 and trunc(sysdate-6)
    MINUS
    SELECT servername
    FROM backup_jobs
    WHER backup_end_time
    between trunc(sysdate)-1 and trunc(sysdate);
    );
    If count(*) > 0 then raise an alarm.
    This report shows you a list of servers which have been backed up 1 week ago, but not yesterday.

    Problems:

    Not all backup systems might support this type of individual report.

    In case that an old server has been decommissioned, this report will raise alarms for next week.

    Count the number of servers backed up in last 24 hours and compare that number A manual backup of a server which is usually not backed up (e.g. servers for testing) would equalize a backup not started.
    Daily / weekly statistics on total backup volume A missing server with small backup volume would not be detected.
    Deviation of daily / last rolling 7 days backup volume PER SERVER or most servers the 7-day rolling backup volume should be quite constant, and one missing full backup would already show a 15% drop and should raise an alarm. However one or a few missing differential backups might not be visible, but latest after the first missing full backup statistics are expected to raise an alarm.