Page MenuHomePhabricator

Gerrit SSH: Intermittent key_verify failed for server_host_key and 'hash mismatch'
Closed, ResolvedPublic

Description

On translatewiki.net during running repoupdate script: Randomly the script bails out with

hash mismatch
key_verify failed for server_host_key
fatal: The remote end hung up unexpectedly
error: Could not fetch origin

This happens since migration of Gerrit to the new server two days ago.


Version: wmf-deployment
Severity: normal

Details

Reference
bz53895

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:59 AM
bzimport added a project: Gerrit.
bzimport set Reference to bz53895.
bzimport added a subscriber: Unknown Object (MLST).

We retained the same key for exactly this reason...

So, this sounds like you've got an old entry in your known_hosts files pointing to the old box.

We changed IP addresses when moving servers (shouldn't have to ever happen again), so please check your known_hosts for any outdated entries that you can remove.

(In reply to comment #2)

We changed IP addresses when moving servers (shouldn't have to ever happen
again), so please check your known_hosts for any outdated entries that you
can remove.

How can I identify outdated entries? There are no IP addresses in known_hosts. Sample entry:

1umKi+qzw6pf8uXi/Z6/KtqlisCw=YFoX/CdDjXhcVUVJ803EiP9nyro= ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEA2JmNg8ir9QvWwmS/C2k0PEqty1O26D0Nq24YGKC5jq1cr/0a92Pk7wa9FMMM/2O88bbe6rXZUPBKzDX1vVtYD+5vR4/c1XTnHWlNJ9sd6xSYjHhznqYs81VnjGMCLMPV1GhlIfUZsnQ+

w1FaQUvJe39TEtwADA7ZOFAfT0M/Oqk=

Still seeing this error randomly.

Several reports of this in the last few days. Reporters include Krenair, YuviPanda, and Krinkle.

(Worked for me when I tried again)

Just happened to me w/operations/puppet.

$ git pull
hash mismatch
key_verify failed for server_host_key
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

(In reply to comment #7)

Just happened to me w/operations/puppet.

$ git pull
hash mismatch
key_verify failed for server_host_key
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

We see similar errors very regularly when updating 600 or so extension repos at translatewiki.net. I'm pretty certain that we have the correct access rights with L10n-bot, have the correct access rights at the local machine, and have consistent scripting up update the repos.

A run I did just now resulted in the following errors:

Permission denied (publickey).
fatal: The remote end hung up unexpectedly
error: Could not fetch origin
/resources/siebrand/mediawiki-extensions/extensions/CategoryMagicWords failed to update

Permission denied (publickey).
fatal: The remote end hung up unexpectedly
error: Could not fetch origin
/resources/siebrand/mediawiki-extensions/extensions/ReplaceSet failed to update

Just to make sure that it wasn't me configuring the two above repos incorrectly, I ran the updates again. This time with the following result:

Permission denied (publickey).
fatal: The remote end hung up unexpectedly
error: Could not fetch origin
/resources/siebrand/mediawiki-extensions/extensions/DidYouKnow failed to update

Permission denied (publickey).
fatal: The remote end hung up unexpectedly
error: Could not fetch gerrit
/resources/siebrand/mediawiki-extensions/extensions/FormatDates failed to update

Permission denied (publickey).
fatal: The remote end hung up unexpectedly
error: Could not fetch gerrit
/resources/siebrand/mediawiki-extensions/extensions/GoogleDocTag failed to update

Permission denied (publickey).
fatal: The remote end hung up unexpectedly
error: Could not fetch origin
/resources/siebrand/mediawiki-extensions/extensions/InviteSignup failed to update

Permission denied (publickey).
fatal: The remote end hung up unexpectedly
error: Could not fetch origin
/resources/siebrand/mediawiki-extensions/extensions/LightweightRDFa failed to update

Permission denied (publickey).
fatal: The remote end hung up unexpectedly
error: Could not fetch origin
Permission denied (publickey).
fatal: The remote end hung up unexpectedly
error: Could not fetch gerrit
/resources/siebrand/mediawiki-extensions/extensions/Numbertext failed to update

/resources/siebrand/mediawiki-extensions/extensions/NumberOfWikis failed to update

Permission denied (publickey).
fatal: The remote end hung up unexpectedly
error: Could not fetch gerrit
/resources/siebrand/mediawiki-extensions/extensions/PageLanguage failed to update

Permission denied (publickey).
fatal: The remote end hung up unexpectedly
error: Could not fetch origin
/resources/siebrand/mediawiki-extensions/extensions/SidebarDonateBox failed to update

hash mismatch
key_verify failed for server_host_key
fatal: The remote end hung up unexpectedly
error: Could not fetch gerrit
/resources/siebrand/mediawiki-extensions/extensions/UserStatus failed to update

Permission denied (publickey).
fatal: The remote end hung up unexpectedly
error: Could not fetch gerrit
/resources/siebrand/mediawiki-extensions/extensions/VersionView failed to update

To compare, when updating repos on localhost form GitHub, I've not seen a similar error once.

  • Bug 57483 has been marked as a duplicate of this bug. ***

That does happen once or two per day on Zuul. Usually "hash mismatch" errors though we had some host key verification failed on Nov 20th.

Also got one today in command line.

FYI: the -1s in Jenkins caused by this are very confusing.

(In reply to comment #12)

Is this related: https://gerrit.wikimedia.org/r/#/c/107036/ ?

Looking at Zuul debugging log on gallium.wikimedia.org it is a different issue. Filled another bug 59991 for it. Seems to be an issue in the python git module.

Another example, this time with the job that sync VisualEditor in mediawiki/extensions.git. The merge of https://gerrit.wikimedia.org/r/#/c/111608/ triggered job http://integration.wikimedia.org/ci/job/mwext-VisualEditor-sync-gerrit/61/console which shows:

ssh -i /var/lib/jenkins/.ssh/jenkins-mwext-sync_id_rsa \
  -p 29418 jenkins-mwext-sync@gerrit.wikimedia.org \
  'gerrit review --code-review +2 --verified +2 --submit b519550809bba725b017281fe6c33c4c2fd123c1'
hash mismatch
key_verify failed for server_host_key

This continues to happen, nearly daily.

You could probably get a good list of affected changesets by grepping logs of #wikimedia-dev for my name and "ignore jenkins" :/

Subsided for a while, then started happening a bit more often for me locally. Example in Gerrit from today: https://gerrit.wikimedia.org/r/#/c/138992/1

Could somebody tcpdump it? It seems it me more like a broken (suddenly terminated) connection, probably occuring (mostly) early in the SSH negotiation phase.

Bartosz: there is no need more for more examples. We have traces of those errors in Zuul log and it happens a couple time per day.

Marcin: we could tcpdump it if only we had a way to reliably reproduce the issue :-(

paul.bourke wrote:

Hi, I've been able to reproduce this on a local Gerrit instance quite reliably by running the following:

while true; do ssh <gerrit> -p 29418; done

A workaround that does work is to use the bouncy castle SSL library. See the following thread for more info: https://groups.google.com/forum/#!topic/repo-discuss/JE7OM6o7DMs

The google group topic mentioned this issue in Apache mina-sshd (upstream from Gerrit):

https://issues.apache.org/jira/browse/SSHD-330

Which has been fixed in https://git-wip-us.apache.org/repos/asf?p=mina-sshd.git;a=commit;h=2aed686bdb21681a421033c6ee5997e5cd8a9a83

If that is indeed the root issue, we them to make a minor release and Gerrit to upgrade to it.

The description of the SSHD-330 issue explains pretty much every aspect of
the bug that we experienced.

From it's sporadic nature to the ways some people could reproduce, but others
couldn't.

I'll see to preparing a new gerrit release ... hopefully we can get something
deployed around that.

Change 143388 had a related patch set uploaded by QChris:
Upgrade sshd to include the fix for hash mismatch

https://gerrit.wikimedia.org/r/143388

Christian could you possibly providee a gerrit.war that has the patch ? I would like to test it out on the labs instance I am using for CI dev. Thanks!

(In reply to Antoine "hashar" Musso from comment #25)

Christian could you possibly providee a gerrit.war that has the patch ?

Sure. For the next 2 weeks, you can fetch it from

http://quelltextlich.at/gerrit-2.8.1-4-ga1048ce.war

I
would like to test it out on the labs instance I am using for CI dev.

Seeing the description of SSHD-330 allowed me to come up with an environment
that allows to reproduce the bug. There, our deployed gerrit war failed for
14 of 10000 connection attempts. The war I linked above showed 0 failures for
10000 connection attempts.

^d already said he'll discuss deploying the war with greg-g. So we'll
hopefully see it live soon.

I have upgraded Gerrit on my test instance integration-dev.eqiad.wmflabs . There is no more any hash mismatch triggered when running for a while:

while true; do ssh -p 29418 localhost; done;

Since this bug has been around for a while and has affected quite some
people, I've been asked to give a short explanation of the root issue
and what SSHD-330 does.

Gerrit uses Apache Mina's SSHD [1] as ssh server. When connecting to
gerrit through ssh, this ssh server uses Java's own crypto/security
implementations to negotiate session keys (i.e.: different for each
connection attempt) with the client. Java's default provider yielded
those session keys without leading zero bytes, and Apache Mina's SSHD
relied on no leading zero bytes being present.

But at some point Java [2] changed behaviour and is no longer
stripping leading zero bytes, but Apache Mina SSHD still relied on no
leading zero bytes being present. Hence assumptions mismatched and
caused the issue.

The Java we use at gerrit.wikimedia.org is recent enough to no longer
strip leading zero bytes. So when connecting to our gerrit through
ssh, either

  • the negotiated session key starts with a non-zero byte, and everything works nicely. This case happens most of the time.
  • the negotiated session key starts with a zero byte. Then gerrit's built-in Apache Mina SSHD falsely treats the key as if there were no leading zero bytes and the connection setup with the client fails.

SSHD-330 adds stripping of leading zero bytes from the session key to
Apache Mina SSHD and thereby fixes the issue we are seeing.


There was recently some FUD around OpenSSL generated keys not being
affected. That did not work for me, and I do not see in code how this
would make a difference.

Also, there was some recent discussion around extracting the keys from
the keystore to proper files. I did not get a chance to try that, but
that could do the trick too ... indirectly.
Because in order to get gerrit to use keys from separate files, one
needs to install BouncyCastle libraries to gerrit. BouncyCastle will
act as provider for the needed security/crypto functionality and
get used instead of Java's default providers. As the BouncyCastle
providers (for now) also strip the leading zero bytes, that could
work out.

Regardless, having Apache Mina SSHD to strip leading zero bytes seems
most reliable, so we backported the Apache Mina SSHD's upstream fix to
the version used in our gerrit, and rebuilt gerrit using that custom
built Apache Mina SSHD.

[1] https://mina.apache.org/sshd-project/
[2] I know that OpenJDK versions up to

OpenJDK Runtime Environment (IcedTea7 2.2.1) (Gentoo build 1.7.0_05-b21)

work and the default providers strip the leading zeros, while the ones from

OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1~0.12.04.2)

do not strip them.

Thanks Krinkle for the pointer to SSHD-330!

Change 143388 merged by Chad:
Upgrade sshd to include the fix for hash mismatch

https://gerrit.wikimedia.org/r/143388

The fix has been deployed at gerrit.wikimedia.org.

(In reply to christian from comment #28)

Since this bug has been around for a while and has affected quite some
people, I've been asked to give a short explanation of the root issue
and what SSHD-330 does.

Gerrit uses Apache Mina's SSHD [1] as ssh server. When connecting to
gerrit through ssh, this ssh server uses Java's own crypto/security
implementations to negotiate session keys (i.e.: different for each
connection attempt) with the client. Java's default provider yielded
those session keys without leading zero bytes, and Apache Mina's SSHD
relied on no leading zero bytes being present.

But at some point Java [2] changed behaviour and is no longer
stripping leading zero bytes, but Apache Mina SSHD still relied on no
leading zero bytes being present. Hence assumptions mismatched and
caused the issue.

The Java we use at gerrit.wikimedia.org is recent enough to no longer
strip leading zero bytes. So when connecting to our gerrit through
ssh, either

  • the negotiated session key starts with a non-zero byte, and everything works nicely. This case happens most of the time.
  • the negotiated session key starts with a zero byte. Then gerrit's built-in Apache Mina SSHD falsely treats the key as if there were no leading zero bytes and the connection setup with the client fails.

SSHD-330 adds stripping of leading zero bytes from the session key to
Apache Mina SSHD and thereby fixes the issue we are seeing.


There was recently some FUD around OpenSSL generated keys not being
affected. That did not work for me, and I do not see in code how this
would make a difference.

Also, there was some recent discussion around extracting the keys from
the keystore to proper files. I did not get a chance to try that, but
that could do the trick too ... indirectly.
Because in order to get gerrit to use keys from separate files, one
needs to install BouncyCastle libraries to gerrit. BouncyCastle will
act as provider for the needed security/crypto functionality and
get used instead of Java's default providers. As the BouncyCastle
providers (for now) also strip the leading zero bytes, that could
work out.

Regardless, having Apache Mina SSHD to strip leading zero bytes seems
most reliable, so we backported the Apache Mina SSHD's upstream fix to
the version used in our gerrit, and rebuilt gerrit using that custom
built Apache Mina SSHD.

[1] https://mina.apache.org/sshd-project/
[2] I know that OpenJDK versions up to

OpenJDK Runtime Environment (IcedTea7 2.2.1) (Gentoo build 1.7.0_05-b21)

work and the default providers strip the leading zeros, while the ones from

OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1~0.12.04.2)

do not strip them.

Thanks Krinkle for the pointer to SSHD-330!

And thank you for the analysis and the informative summary -- well done!