?

Log in

No account? Create an account

Previous Entry | Next Entry

How Not to be a Sysadmin

My dear employer purchased another company about six years ago, adding ground & car bookings to our business travel portfolio. Of all of the acquisitions we've done over the years, this was practically the only one which made sense, the only one which has been financially worthwhile and the only one still operating, but that's another story.

This particular car service division has been largely independent of the larger firm: our travel systems makes system calls into the car service systems, but we haven't tried doing a full integration of their services or their staff. Our core travel systems are all based on Linux with Solaris/Oracle handling the backend databases, while the car service machines are all Windows Server based with Microsoft products and some cloud-based services to supplement.

In the past month, two of the primary people from the car services division have left the firm, and because we have no other staffing, care & feeding of their systems have fallen to my systems engineering team. And now we're seeing the true nature of the nightmare...

These car service systems require constant care. Constant. We've learned that the core database has been receiving manual maintenance daily for the past eight years it has been in service. We've learned the logging system has been manually restarted every 48 hours or so for the past number of years. There's a stack of little things like this which have been consuming the full attention of two fulltime staff on a daily basis.

I'm horrified by the amount of work that has been required daily if not hourly to maintain uptime for these systems. I'm horrified that no one in management seems to have noticed and thought it odd. I'm horrified that no one has seen fit to fix any of these problems, especially the guys who have been doing the work. And I'm horrified that even if the guys couldn't correct the root problem, that they didn't even attempt to automate the required recovery steps. Seriously?!

My team is now trying to pick up the pieces but I have little Windows experience and the training hand-off occurred while I was on vacation so I'm missing huge chunks of knowledge about their architecture, single points of failure, and other gems one could collect from those who built & maintained these things. It doesn't take great knowledge though to know that This Isn't Right.

Remember your training, young padawan:
1. Automate everything.
2. Automate recoveries as much as possible.
3. If something breaks daily, fix it.
4. Document everything so the people coming after you have a guide.

Comments

( 2 comments — Leave a comment )
jkusters
Sep. 29th, 2016 06:12 pm (UTC)
Sounds quite typical of an IT shop that never has time to do things right, but always has time to do the next thing (shoddily).
apparentparadox
Sep. 29th, 2016 09:07 pm (UTC)
There's an old saying in programming: it was hard to write, it should be hard to understand :-)
( 2 comments — Leave a comment )