Bug: ssl error with postgres and celery

Wed Nov 4 14:01:32 UTC 2020

On 04/11/2020 12.30, Thomas De Schampheleire wrote:
> El mié., 4 nov. 2020 a las 11:36, Valentin Kleibel
> (<valentin at vrvis.at>) escribió:
>>
>> Hi Thomas,
>>
>>> I don't have answers but several questions.
>>>
>>> Can you say more about how you set up the database connection? Do you do
>>> anything special here? Why is it different with SSL than without?
>> About the setup:
>> the pg_hba.conf for the database contains the line:
>> hostssl all         all         1.2.3.0/24          md5
>> which enforces the automatic and transparent use of ssl connections to
>> our database server (that is in the subnet 1.2.3.0/16). To reproduce
>> this on a local database you could set:
>> hostssl all         all         127.0.0.1/32          md5
>> to use ssl with it.
>> I'm sure of one difference: with SSL a connection reused by a forked
>> process will for sure lead to an error because the connection is
>> stateful and will, for security reasons, fail if reused.
>>
>>> As far as I can see you would just need to give some extra parameters in
>>> the sqlalchemy.url setting in your ini file.
>>
>> We were looking through all documented sqlalchemy.url settings and
>> couldn't find anything of use here.
> 
> 
> See for example the description here:
> https://github.com/sqlalchemy/sqlalchemy/issues/4146
> and a reference here:
> https://www.postgresql.org/docs/10/libpq-connect.html#LIBPQ-CONNSTRING
> 
> As far as I understand it, you should be able to add ?sslmode=verify-full
> to the database connection string, possible with references to
> certificate or key files.
> 
We took a look there.
The setting you refer to is needed if the certificate chain of the 
database server can't be validate. It can be used when e.g. testing with 
a local db and a self-signed cert. Our certificate chain validates and 
we don't need this, the ssl connection also works in calls coming from 
uwsgi.

>>
>>
>>> But your mail seems to suggest that you have a connection created upfront
>>> and would like this one to be reused?
>>
>> We do not want to reuse a previously created connection but this is how
>> the celery workers seem to behave.
>>
>>> Which are the uwsgi-related settings in your ini file? The 'lazy=true'
>>> setting is not a choice for Kallithea and should be enabled in any case.
>>> Our uwsgi template for ini files gives this comment as clarification for
>>> 'lazy=true': "App *must* be loaded in workers - db connections can't be
>>> shared"
>>
>> The only related setting is lazy=true. As you are saying, the database
>> connections can't be shared. this can be achieved in multiple ways and
>> one of them (although maybe not the most efficient one) is to use the
>> lazy setting in uwsgi.
>> access through uwsgi does work without issues in our setup.
> 
> 
> My understanding was that Kallithea creates the db connection in each
> worker anyway, and that it shouldn't matter whether SSL is used or
> not.
> 
> But if I understand things correctly, celery may have multiple
> workers, and each worker may create additional forks. If the db is set
> up for each worker, it would still be shared between the forks. Is
> that matching your understanding and situation?

My understanding is, that the Database connection is only established 
once and each celery worker sets up its own Session.
As the underlying SSL connection is not newly established for the worker 
it is the shared between forked processes. This leads to the SSL error 
which will be triggered if both processes do anything with on this 
connection, even a keepalive.

> I still wonder where the impact of SSL is. This sounds like a generic
> problem, but perhaps it works most of the time if not using SSL, and
> so we just don't see it?

The database connection itself might even be safe to use this way (I 
honestly don't know) because of the use of sessions, but if SSL is used 
on a lower level the SSL connection will for sure die.

>>
>>
>>> Did you use Kallithea in this mode successfully before, perhaps on an older
>>> version? Your mail seems to suggest that but I'm not sure.
>>
>> Yes, we did use Kallithea before successfully with this setup, with
>> version 0.4.1. We think the problem arises with the upgrade due to the
>> way celery workers seem to behave with the new prefork worker model.
>> This change came with the upgrade of  celery to version 4.
> 
> 
> As far as I can see, prefork was also used in celery 3.
> Could it be that other changes make the difference? For example, I see:
> 
> https://docs.celeryproject.org/en/stable/history/whatsnew-4.0.html?highlight=prefork#ofair-is-now-the-default-scheduling-strategy
> 
> "-Ofair is now the default scheduling strategy
> 
> To re-enable the default behavior in 3.1 use the -Ofast command-line option.
> ..."
> 

It might be that this change brought the problem to the surface but it 
is not a fix.
My understanding is that the problem existed before but the fast 
scheduling strategy almost always reused the same process and thus the 
ssl context was almost always fine.
The fair scheduling strategy might not show the problem as fast but in 
the end no 2 workers will be able to run concurrently in this case either.

>>> Note: I found this related question:
>>> https://stackoverflow.com/questions/51466007/how-to-use-psycopg2-properly-in-a-prefork-celery-environment
>>> but it does not involve sqlalchemy and does not really give immediate
>>> solutions.
>>
>> While looking around for solutions we found this blogpost:
>> https://virtualandy.wordpress.com/2019/09/04/a-fix-for-operationalerror-psycopg2-operationalerror-ssl-error-decryption-failed-or-bad-record-mac/
>> Which is about uwsgi but also describes an alternative solution to
>> setting lazy=true which is to call engine.dispose in order to make sure
>> a new connection is used.
>> This is the same solution we also found in the sqalchemy documentation
>> in the section about connection pools use with mutiprocessing or forked
>> processes:
>> https://docs.sqlalchemy.org/en/13/core/pooling.html#using-connection-pools-with-multiprocessing-or-os-fork
>>
>> We implemented this in kallithea/lib/celerylib/__init__.py for the
>> celery workers.
>> Attached is a diff for the current stable release which works well in
>> our setup, although we are not sure if it has any unwanted side effects
>> we just didn't encounter by now.
> 
> Thanks for the patch and links.
> 
> So with that patch, you no longer see issues and you can have
> 'celery.worker_concurrency' set higher than 1 ?

Yes, we have set:
celery.worker_concurrency = 2
celery.worker_max_tasks_per_child = 1
without any further changes and did no longer see those issues.

> I would like to await the analysis of Mads Kiilerich on this.
I'm looking forward to it,

Thanks,
Valentin