MV Communications: 2004 outages/events
2004 outages/events
- Dec 23/24 2004: everything
- There was a major power outage in downtown Manchester starting
at about 7:30 December 23 lasting much longer than our UPS backups
are expected to hold. Power came back a few hours later but
went out again as we were finishing bringing systems back up.
Things are back up again as of about 1:40 AM Dec 24.
- Dec 21 2004: web server
- The web server had a crash this morning at about the same time
a router was being rebooted (next item).
- Dec 21 2004: router
- A router needed to be rebooted this morning at about 9AM.
Some routing issues may have been seen prior to this (specifically
around 1AM and around 7AM).
- Nov 5 and Nov 7, 2004: router
- A core router was misbehaving and needed to be rebooted. The
cause is believed to be memory-related; we've made some changes to
address that problem (and rebooted to take better effect). Users
may have seen sluggishness and unresponsiveness during the periods
leading up to the reboots.
- Aug 30, 2004: News server
- The news server has been having intermittent troubles over
the weekend and today. We've tracked it down (or so we believe)
to a non-working fan on one of the CPUs in the server. We're
swapping that out and will bring the server back up once that's
done. ETA 13:00 or earlier.
Update: Looks like it will be longer than that- no ETA.
Status: Up as of about 17:15. We had put in a faster
CPU as well as a new fan; the operating system didn't care for the
faster CPU, and we had to put the original back in.
- Friday August 27, 2004: web server
- The web server copper.mv.net took a timeout starting around
23:40 tonight; it was up but not responding. It was nudged and back
up at about 23:20.
- Sunday August 22, 2004: servers
- shell server: As previously announced, we will be putting the new UNIX shell server
into place this weekend. The shell server will be unavailable while this
process is ongoing. This will take quite some time while all user accounts
and data are transferred, ending any time from Sunday evening to Monday
morning.
Status: Complete as of 20:45
mail server: We are also going to take this opportunity to
start an upgrade process on the mail server. This will involve replacing
a SCSI card and enabling another CPU on the server. The mail server
should be down for a half hour or less while this is being done.
Status: complete as of about 17:15
secondary dialup: Also, we will be moving some of the
equipment that accepts calls on the secondary dialup numbers
(not the main group). The secondary numbers will be unavailable
for a brief period while this is happening.
Status: complete as of 15:30
- Aug 18, 2004: various power-related
- Shortly after noon we had a power failure in our server room, caused by
somebody (the company president, who should know better), overloading one of
the UPS (battery backup power) units. The main effect was that two servers
were out of operation for about 45 minutes; however a couple of non-core
routers were also down for a much shorter period.
- Aug 7, 2004: shell server
- The shell server crashed at about 5:10 due to unknown causes.
Back up again at about 9:15.
Note: The shell server is being replaced (see the recent
News Items and notes in the
July newsletter).
- July 1, 9:30, Paetec T3 move
- Per the problem reported below (See June 28) our Paetec T3
circuit is scheduled to be moved to another router at 9:30 July 1.
This will result in the T3 being down for a short while, with the
load carried on our other backbone circuits. (This will probably
not be noticable to most people.)
Status 9:50 Move completed.
- June 28, Paetec link
- Once again, routing issues on our Paetec link made it necessary
for us to disable that circuit until the issues can be resolved.
This began around 5:30 PM.
Status June 29 3pm:Paetec has tracked the problem down
to a software bug on their router that we are connected to. The
issue will be corrected by a future upgrade; in the meantime we
will be scheduling a move of our circuit to a different router
that does not have this bug (different model with different software
set).
- June 28, networking
- We had several periods today (first around 6AM, next around
8:30AM, and finally near 8PM) where one of our routers was not
performing correctly. Specifically, it was being flooded and unable
to keep up. During the day, by looking at performance graphs, we
were able to form a theory about the source of the flood. When it
reoccured the final time (in the evening) we disconnected the
suspect, which confirmed that it was indeed the cause.
- June 9, routing
- There are some routing problems in one of our backbones,
outside of our network. We've turned that backbone off until
the problem can be investigated further.
Status June 10, 12:30: The carrier resolved an issue
overnight; this circuit is back up and carrying traffic again.
- June 9, router
- A central router crashed (unexpectedly) at around 11:15.
It was back up after about 10 minutes.
- May 27, MNH POP
- At about 12:45 a power transformer serving the area near our
MNH POP blew out. PSNH is on site; the repair is expected to take
at least until tonight if not later. Affected are the small analog
dialin group that we have there (645-4986), as well as some tenants
in the building served by high speed access.
Status: May 28 02:30: Power is back
- Saturday May 22, web server, scheduled
- We'll be adding some disk space to the web server and rebuilding the
root disk on the server as well. Most of this will be done "live" and won't
necessitate much downtime, other than about three 5-10 minute outages to
swap hardware in and out. The first reboot should occur around noontime,
with the second and third occuring during the day as the operation
progresses. Some of the disk space will be marked read-only during the
course of the afternoon, in order to prevent changes to data that is being
transferred to the new locations.
Status May 22: Now in progress
Status 15:45: Completed
- May 12, routing
- We're again (see the May 3 entry) seeing problems reaching a number of
destinations via one of our backbone connections. We've shut down this
circuit until it can be resolved. This may mean slower response times
than normal reaching some sites, as the other backbone circuits have
to pick up some extra load.
Status: The routing had stabilized when we tested it again
shortly after midnihg (May 13) and so that circuit is back up again.
We'll continue to watch it.
- May 6, router
- Today we had a reboot of a core router to upgrade its software to take
advantage of some improvements in various areas. There were a couple of
problems with the upgrade, requiring several additional reboots to
correct. Each reboot took only a couple of minutes, however some DSL DHCP
clients again had trouble re-obtaining leases. This should be an automatic
function on the part of DHCP clients but we have seen that many clients fall
short there. (DHCP server/client issues are, in fact, one of the areas
motivating us to apply this upgrade.)
- May 3, routing
- Starting around noon we began seeing routing issues to some
destinations from some sources (i.e., not everyone had a problem with a
given destination). The problem is only occuring for some paths on one of
our backbone connection. We had turned off that backbone for a while to
route around the problems, however that circuit is back on as of about 16:50
so that the provider can troubleshoot it.
17:30 Traffic on that circuit has been shut down again.
May 4 11:00 Circuit is being used for outbound traffic only.
14:00 Routing problems are no longer occuring; the circuit is
now back up fully.
- April 20, 2004: router
- We had to reboot our core router due to a memory issue.
While we had it down we took the opportunity to add some memory
to it. Downtime was less than 10 minutes.
Note: We briefly considered using this opportunity
to schedule an impromptu software upgrade but thought better of it.
- April 15, 2004: news server
- The news server was down from about 4AM through 9:45AM
due to an unknown cause. (perhaps it was overtaxed?)
- April 2, 2004: news server
- The news server is down as of about 7:30pm. It's expected back
up by midnight.
Status: Up at about 11:40pm
- March 24, 2004: backbone
- Starting at 11AM we'll be doing some work on reconfiguring one of our
backbone circuits (the Paetec T3). The circuit will be down for a short
while during this process; we expect any impact to be minimal as the traffic
will be distributed to our other backbone circuits.
Status: Finished by noon.
- March 17, 2004: router
- Our core router was unreachable for about 10 minutes, ending
around 6pm tonight, due to a faulty cable connector.
- Feb 28, 2004: news server
- The usenet news server (news.mv.net) is down with a RAID/hardware
failure. Coincidentally we have been on the verge of installing
new news hardware/software (any week now). We will be spending some
time today attempting to recover from the RAID failure and bring the
existing system back up. If that can not be completed today, we'll
simply work on bringing up the new system instead (a process that
will take a day or two). More status will be here when we know.
Status:The news server is up as of about 15:15
- Feb 15 2004: Router reboot
- A core router had to be rebooted at around 4PM today to clear up a
memory problem. The problem was affecting the router's behaviour. Any DSL
customer having trouble after this reboot may need to release/renew your
DHCP lease.
- Jan 30 2004: frame relay
- Our frame relay connectivity is down as of about 12:20.
All frame customers are affected. Verizon has been notified and
is looking into the trouble.
Status:Back up as of about 18:00 (Verizon repair)
- Jan 16 2004: Mail server
- The mail server is experiencing a temporary outage as of about 23:10.
We expect to have it remedied within the hour.
Status: back up before 24:00. Cause of the outage was operator
error upgrading a software package.
Note: the original posting incorrectly stated the outage start
as 22:10, it was later corrected to the actual time (23:10).
Copyright © 1998 thru 2008 MV Communications, Inc.