Abstract |
While there is a multitude of monitoring systems available, interpreting
the monitored data becomes ever more difficult. Integrating the data into
meaningful information, using this information to predict your environments
needs, and thus proactively preventing disasters is what makes the difference
between the herd of systems administrators and the leaders of the pack.
Since you cannot predict which system parts will show problematic behavior,
all parts have to be monitored. Depending on the actual job, different system
loads will stress different subsystems. Systematically identifying system
parts to be monitored, as well as deciding on the granularity of your probes,
is the key to successful data acquisition.
Simply dumping the aquired data into log files or onto a remote syslog
server may place additional stress on already loaded components. Choosing a
sensible data acquisition mechanism as well as an unintrusive data store
prevents choking the machines.
Sifting through all the data proves to be difficult due to the sheer mass
of collected data. Various sampling techniques help to spot sudden outages.
To identify long-term trends will help to prevent resource shortages long
before they become acute. Ultimately, a picture is worth more
than a thousand words. Choosing the best graphing function helps to recognize trends in time to avert problems.
Unfortunately, there are factors outside the box which have to be monitored
too, e.g. holiday seasons, or advertising campaignes causing unusual impact
on your webservers. It is up to you to identify which data is missing from
your set. You have to come up with methods to get it logged. Integrating all
the monitored data, building information history, as well as dropping
irrelevant noise becomes the key to see the whole picture.
Wielding the destilled information helps to decide which investment -- be
it time, hardware, personnel, education, or new methods -- will further your
environment. Use it to guide upper management to the most efficient
decision. |