Dangers of not closing a db Connection in time: Story of a Bug at PPA website
It was a weird weekend for the tech team at Programming Pathshala. Saturday morning started with a series of messages on our student groups saying that they were unable to access the course content. We tried opening the website and indeed the course contents were not loading.
We went ahead to check the microservice logs — if it tells us about some API calls failing or if the microservice itself was down — Nothing there! So, basically no error had occurred at the application level.
We went ahead to look at our monitoring services — CPU Usage, Memory Usage, Disk Usage, Throughput - everything was normal.
The only thing spiking up was latency in a particular microservice — the one responsible for serving the course content.
Suddenly, almost ALL the APIs instead of taking some 4–5s were taking 30s and then timing out.
It almost seemed like a ghost had crept into the system slowing the Google machines down(We use GCP for deployment) where our server was hosted because nothing else had changed. No deployments, etc.
Next, we decided to look at the logs of one such transaction — maybe looking at the timestamps of different logs will tell us which part of the application had slowed down.
This stumped us even further. We saw that we were getting the request from the frontend fine. The request was contacting another microservice to check if the user is valid and subscribed. We got the response from this microservice in reasonable time.
After that, this API went to a function “getTagsFromDB” where it had to fetch the tags from the DB. There were no logs line after this — Neither of the DB call succeeding nor failing.
Similar were the logs of other such calls — They all started fine but as soon as they had to fetch something from the database, they just stopped and the timeout happened.
Another strange thing was the behaviour happening on restart of the server: As soon as we restarted the service, it worked fine and then started becoming slow after some time.
All this lead to 2 conclusions —
- It had something to do with the DB. API waits for something before it wants to read/write from DB. This wait happens for > 30s. In the meantime, express times out the API call. That is why the logs stop suddenly in the middle as soon as it goes to the DB call.
- It is an issue which takes some time to occur. So, initially the API is able to access DB fine. This also meant that the DB itself is fine but the issue is at application level in accessing it.
So, we came up with a hypothesis that maybe the API is unable to acquire a connection with the database. And it is somehow not able to do it for atleast 30s before timing out. This pointed out to the max number of connections we can make with the DB and somehow that limit being reached.
We checked up our sequelize code and saw these lines:
Basically, we are saying our pool of DB connections should allow at max 10 connections at a time. Also, that a connection should be released only if it is inactive for 30s.
To summarize the issue that had occurred — We were acquiring some connections to access and query the DB from the pool. In some cases, after using it, it might not have been released back to the pool. Now, slowly these unreleased connections will start accumulating. Though some of them might release due to timeout of 30s, but still there can be a point where there were enough active connections to exhaust this limit of 10. Then, the next API call which would like to acquire a connection with DB (the 11th one) — sequelize will not allow it and stall this request. Meanwhile, there will be a timeout from express side.
Why did this occur suddenly? We saw that there were some peaks in the past as well but we chose to ignore it because they were there for a very small time. It was this time that the load was consistent enough for a long time and did not allow connections to be released on time.
Moral: Never forget to close the connections you acquire when the job is done — Be it with a Database or with a file or a network/socket connection! It can cost a downtime later. Also, be clear about how an external library that you are using is configuring stuff and not just blindly put random params.