Optimizing Kallithea for heavy traffic

Mon Apr 20 14:13:50 EDT 2015

On 04/20/2015 12:52 PM, Jean-Francois Beaumont wrote:
> I've been running stress tests on my install and so far I've not been 
> able to make Kallithea responsive enough for the traffic I have 
> (mostly cloning).
>
> After doing some preliminary setup, I've configured postgresql and 
> added the 1.7k repositories to it. The browsing is not that bad given 
> the 80GB our repositories weight. However, this server is under heavy 
> cloning activities as some repositories are used for fetching 
> configurations (small) and some other for knowledge resources (big). 
> CI servers are making the life of this server difficult as when we hit 
> midnight, they all start testing jobs and must fetch a lot of data 
> from the HG server.
>
> The current server uses a simple hgweb and it works well. All queries 
> are successfully processed. However with Kallithea, I got all sort of 
> proxy errors. I've tried to increase the timeout a bit but it is 
> clearly not scaling. Since both services are installed on the same 
> machine and I ran my tests when hgweb was quiet, I was expecting to 
> have the same kind of performances as with hgweb but that's not what's 
> happening.
>
> So I've been looking around and it seems people are talking about 
> Celery but I'm not sure I need to pursue this. Moreover, I was 
> wondering if it would help if Kallithea would be running a WSGI script 
> inside Apache's configuration instead of 'paster serve' with 
> processes/threads.

Kallithea is not just a normal web app serving pages and it seems like 
the bottlenecks depends a lot on the repositories and the development 
process.

My bottleneck was cloning huge repositories over slow connections which 
meant that each user kept a worker process. The bundles were also so big 
that buffering in memory or on disk gave too latency before it started 
sending data and it used too much memory or gave too much disk traffic 
... plus we had workers that sat idle while other parts of the stack 
were spooling data. With static_files = false we now just stream 
everything and have 20 worker processes - that works just fine for our 
100+ users.

Another thing to watch out for is browser users who create 8 
simultaneous connections and keep them alive for "many" seconds. That 
can cause problems with the front end servers ... and they can in some 
setups forward the problem to the workers.

So ... make sure you profile exactly where the bottleneck is so you can 
fix that.

You can perhaps also get more advice here on the list if you describe in 
details how your setup is - front end servers/proxies and workers and 
their configuration and load and mem usage of the machines.

ps: I'm in Montreal until tomorrow evening ... if you are where you 
address indicates you are and if you want to meet up ;-)

/Mads