Dependent Origination

cannot start up mongos process

Posted on: July 4, 2012

The power outage of Amazon last weekend put all our machines out and by Saturday afternoon all was fine except our database instance, which was never recovered in the end 😦 I thought our MongoDB was working on Saturday but only discovered on Monday that a weird mongod process was running and I couldn’t start up our usual setup with config servers, shard servers and the mongos process. mongos would always complain about ‘cannot upgrade from 3 to 2’ and just quit.

It puzzled me for quite some hours — we have tried many things trying to identify what is the root cause. At first I thought the process cannot find ‘localhost’ but telnet localhost worked fine. Then we digged into those logs trying to get more useful error messages. I started the config servers and mongos on a different machine talking to the same shard server and that worked fine so I felt like the machine itself was the problem after reboot. Finally googling the error message solidified my suspicion that the package I have installed onto the machine might undergo an incomplete update or something so I removed the package and used the binaries downloaded from mongo’s own website directly. Everything works fine since. Phew.

This page has a complete list of manipulating packages.

Here is a direct command line on how to remove an installed package using apt-get.

apt-cache search SearchTerm [search for a package from the source depot] which isn’t very useful in our case but listed here for sake of completeness.

Note: the removed package is the mongodb package installed from apt-get. The mongo site has instructions installing mongodb-10gen, which I haven’t tried so I don’t know if mongodb-10gen is a better package.

The benefits of installing a package is (1) easy removal; (2) they install mongo as a service for you. Installing mongo as a service has the automatic restart coming for free — you can edit a bunch of configuration files so the mongod’s, mongos’s will start up the way you want them to be. Here is someone’s configuration files for a sharded cluster with replica sets.

I haven’t gone down that route for now either since I have the startup script in rc.local and we are having a one-shard setup which is really simple. I am probably going to change the setup soon so more updates on that front later.

Major lessons learned:

1. think more during an outage — don’t just think everything is fine even if on the surface things are fine — give it more thought

2. always try to prove your own conclusions — like if you think localhost isn’t being recognized, there are plenty of ways of verifying that speculation.

3. read the logs — read all the logs you can find

4. be persistent and ask for help — one person has limitations and other people can offer helpful and different ways of thinking about the problem


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

July 2012
« Jun   Aug »


Flickr Photos

%d bloggers like this: