cannot start up mongos process
Posted July 4, 2012on:
The power outage of Amazon last weekend put all our machines out and by Saturday afternoon all was fine except our database instance, which was never recovered in the end 😦 I thought our MongoDB was working on Saturday but only discovered on Monday that a weird mongod process was running and I couldn’t start up our usual setup with config servers, shard servers and the mongos process. mongos would always complain about ‘cannot upgrade from 3 to 2’ and just quit.
It puzzled me for quite some hours — we have tried many things trying to identify what is the root cause. At first I thought the process cannot find ‘localhost’ but telnet localhost worked fine. Then we digged into those logs trying to get more useful error messages. I started the config servers and mongos on a different machine talking to the same shard server and that worked fine so I felt like the machine itself was the problem after reboot. Finally googling the error message solidified my suspicion that the package I have installed onto the machine might undergo an incomplete update or something so I removed the package and used the binaries downloaded from mongo’s own website directly. Everything works fine since. Phew.
This page has a complete list of manipulating packages.
Here is a direct command line on how to remove an installed package using apt-get.
apt-cache search SearchTerm [search for a package from the source depot] which isn’t very useful in our case but listed here for sake of completeness.
Note: the removed package is the mongodb package installed from apt-get. The mongo site has instructions installing mongodb-10gen, which I haven’t tried so I don’t know if mongodb-10gen is a better package.
The benefits of installing a package is (1) easy removal; (2) they install mongo as a service for you. Installing mongo as a service has the automatic restart coming for free — you can edit a bunch of configuration files so the mongod’s, mongos’s will start up the way you want them to be. Here is someone’s configuration files for a sharded cluster with replica sets.
I haven’t gone down that route for now either since I have the startup script in rc.local and we are having a one-shard setup which is really simple. I am probably going to change the setup soon so more updates on that front later.
Major lessons learned:
1. think more during an outage — don’t just think everything is fine even if on the surface things are fine — give it more thought
2. always try to prove your own conclusions — like if you think localhost isn’t being recognized, there are plenty of ways of verifying that speculation.
3. read the logs — read all the logs you can find
4. be persistent and ask for help — one person has limitations and other people can offer helpful and different ways of thinking about the problem