Project

General

Profile

Actions

Bug #23455

closed

Upgrade and migration 1.15.6 to 1.16.1 (with Katello) - Foreman unstable and unusable

Added by Josh Pavel about 6 years ago. Updated 9 months ago.

Status:
Rejected
Priority:
Urgent
Assignee:
Category:
Upgrades
Target version:
-
Difficulty:
Triaged:
No
Fixed in Releases:
Found in Releases:

Description

We are moving/migrating/upgrading our Foreman installation.

The backup/restore/upgrade went well (I hit https://bugzilla.redhat.com/show_bug.cgi?id=1556819, but the workaround offered there worked).

As I began attaching the smart-proxies back to the new Foreman, the first few went without issue. Eventually, though, I began to get flooded with messages like this in the log:

Apr 30 12:41:40 foreman-01 qpidd: 2018-04-30 12:41:40 [Protocol] error Error on attach: Node not found: pulp.agent.c0500867-d215-408e-a1ed-f8ed8b5e4075
Apr 30 12:41:40 foreman-01 qpidd32429: 2018-04-30 12:41:40 [Protocol] error Error on attach: Node not found: pulp.agent.c0500867-d215-408e-a1ed-f8ed8b5e4075
Apr 30 12:41:40 foreman-01 qpidd32429: 2018-04-30 12:41:40 [Protocol] error Error on attach: Node not found: pulp.agent.8278fc78-0b10-4ef5-b2de-b22b3e502007
Apr 30 12:41:40 foreman-01 qpidd: 2018-04-30 12:41:40 [Protocol] error Error on attach: Node not found: pulp.agent.8278fc78-0b10-4ef5-b2de-b22b3e502007
Apr 30 12:41:44 foreman-01 qpidd32429: 2018-04-30 12:41:44 [Protocol] error Error on attach: Node not found: pulp.agent.a3754ba6-6067-4c46-899a-248cabc4a2d8
Apr 30 12:41:44 foreman-01 qpidd: 2018-04-30 12:41:44 [Protocol] error Error on attach: Node not found: pulp.agent.a3754ba6-6067-4c46-899a-248cabc4a2d8
Apr 30 12:41:45 foreman-01 qpidd: 2018-04-30 12:41:45 [Protocol] error Error on attach: Node not found: pulp.agent.033fe747-c608-4207-9176-a17cdedbff49
Apr 30 12:41:45 foreman-01 qpidd32429: 2018-04-30 12:41:45 [Protocol] error Error on attach: Node not found: pulp.agent.033fe747-c608-4207-9176-a17cdedbff49
Apr 30 12:41:45 foreman-01 qpidd32429: 2018-04-30 12:41:45 [Protocol] error Error on attach: Node not found: pulp.agent.af48ce1d-a58d-4b58-bf0f-c005e2c754fc
Apr 30 12:41:45 foreman-01 qpidd: 2018-04-30 12:41:45 [Protocol] error Error on attach: Node not found: pulp.agent.af48ce1d-a58d-4b58-bf0f-c005e2c754fc

...and I notice that the UI is locked up and unresponsive until a restart, in which case it is only operable for a few minutes.

There are constantly 5 or 6 postgres tasks taking up a full core that do not seem to resolve themselves, even when they've been let run for several hours.

33848 | -00:00:01.183384 | foreman   | SELECT  "sources".* FROM "sources" WHERE "sources"."digest" = $1  ORDER BY "sources"."id" ASC LIMIT 1
33963 | -00:00:03.247328 | foreman | SELECT "sources".* FROM "sources" WHERE "sources"."digest" = $1 ORDER BY "sources"."id" ASC LIMIT 1
33928 | -00:00:04.361622 | foreman | SELECT "sources".* FROM "sources" WHERE "sources"."digest" = $1 ORDER BY "sources"."id" ASC LIMIT 1
33889 | -00:00:04.430823 | foreman | SELECT "sources".* FROM "sources" WHERE "sources"."digest" = $1 ORDER BY "sources"."id" ASC LIMIT 1
33780 | -00:00:04.430876 | foreman | SELECT "sources".* FROM "sources" WHERE "sources"."digest" = $1 ORDER BY "sources"."id" ASC LIMIT 1
33814 | -00:00:04.431819 | foreman | SELECT "sources".* FROM "sources" WHERE "sources"."digest" = $1 ORDER BY "sources"."id" ASC LIMIT 1
Actions #1

Updated by Josh Pavel about 6 years ago

I enabled sql logging, and it appeared that the postgres queries that were holding up everything were related to processing OpenSCAP. I turned off OpenSCAP and ran vacuum analyze, and things seem to be much better. I am still getting the qpidd/pulp.agent errors.

Actions #2

Updated by Andrew Kofink about 6 years ago

  • Tracker changed from Support to Bug

Can you confirm that your backup/restore restored /var/lib/qpidd ?

Actions #3

Updated by Josh Pavel about 6 years ago

I believe it was not:

[root@foreman-01 katello-backup-20180428152442]# ls la
total 249345816
drwxrwx--
. 2 root postgres 4096 Apr 28 20:29 .
drwxrwx---. 6 root postgres 12288 Apr 30 07:57 ..
rw-r--r-. 1 root root 352057984 Apr 28 15:42 config_files.tar.gz
rw-r--r-. 1 root root 25013972 Apr 28 15:42 .config.snar
rw-r--r-. 1 root root 41008 Apr 28 15:24 metadata.yml
rw-r--r-. 1 root root 14166466469 Apr 28 18:29 mongo_data.tar.gz
rw-r--r-. 1 root root 126 Apr 28 18:29 .mongo.snar
rw-r--r-. 1 root root 37676653124 Apr 28 18:08 pgsql_data.tar.gz
rw-r--r-. 1 root root 44445 Apr 28 18:08 .postgres.snar
rw-r--r-. 1 root root 202906142720 Apr 28 17:10 pulp_data.tar
rw-r--r-. 1 root root 203443453 Apr 28 17:10 .pulp.snar

Actions #4

Updated by John Mitsch about 6 years ago

  • Assignee set to Christine Fouant

The directory /var/lib/qpidd should be in config_files.tar.gz, you can look at the directories it contains with vim or with tar -tf config_files.tar.gz

Actions #5

Updated by Justin Sherrill about 6 years ago

  • Status changed from New to Need more information

Josh, were you able to check that tar file?

Actions #6

Updated by Josh Pavel almost 6 years ago

Yes - I extracted config_files.tar.gz, and the only directories it had under /var/lib were "candlepin" and "puppet" - nothing for qpidd.

The on-going issue I have is that qdrouterd does not seem to be functional; even the foreman server itself can't connect. I have this repeatedly in the log:
2018-05-21 20:57:29.586236 +0000 SERVER (info) Connection from <IP>:47878 (to 0.0.0.0:5646) failed: amqp:connection:framing-error SSL Failure: Unknown error

Actions #7

Updated by Andrew Kofink almost 6 years ago

  • Assignee changed from Christine Fouant to John Mitsch
Actions #8

Updated by John Mitsch almost 6 years ago

Discussed this off-thread, where I suggested taring /var/lib/qpidd from the old machine and untaring on the new machine. I'm not sure why qpidd was missed in the backup, it should be included from the code logic

Actions #9

Updated by Josh Pavel almost 6 years ago

As John stated, I manually copied over the .qpidd data from the old server to the new. That is now in place, but I am still having issues, specifically from what I can tell related to qdrouterd related functions.

As I see qdrouterd SSL errors, I looked at the config:
ssl-profile {
name: server
cert-db: /etc/pki/katello/certs/katello-default-ca.crt
cert-file: /etc/pki/katello/qpid_router_server.crt
key-file: /etc/pki/katello/qpid_router_server.key
}

If I look at /etc/pki/katello-certs-tools/certs/katello-default-ca.crt, I see the old hostname in there (and it has a date from 2017, before the server rename/migration).

The errors in qdrouterd.log that I see are:
2018-05-31 15:04:46.277338 +0000 SERVER (info) Connection from <redacted IP>:58952 (to 0.0.0.0:5646) failed: amqp:connection:framing-error SSL Failure: Unknown error
and
2018-05-31 15:04:56.347872 +0000 SERVER (info) Connection from <redacted IP>:41064 (to 0.0.0.0:5646) failed: amqp:connection:framing-error No valid protocol header found

Those seem coupled with:
2018-05-31 15:04:46.277869 +0000 ROUTER_LS (info) Link to Neighbor Router Lost - link_tag=8

Actions #10

Updated by John Mitsch almost 6 years ago

Josh,

Were you able to resolve this issue?

-John

Actions #11

Updated by Josh Pavel almost 6 years ago

Unfortunately no. Qdrouterd is still full of errors, and all hosts report "unknown" status, even with katello-agent running. I believe the certs are mismatched with the hostname. How can I regenerate those?

Actions #12

Updated by Justin Sherrill almost 6 years ago

You could try using katello-change-hostname to change to a new hostname (and then potentially change it back). That should force the regeneration of all the certs.

Do you have multiple smart proxies running running qdrouterd? There is currently a bug in proton that causes all qdrouterd's to start erroring if any one smart proxy has an ssl issue: https://issues.apache.org/jira/browse/PROTON-1587

What made you think there is some cert hostname mismatch? Have you found any evidence of that?

Actions #13

Updated by Josh Pavel almost 6 years ago

The reason I suspect it is that the ca issuer/subject/X509v3 Authority Key Identifier all contain the original hostname.
/etc/pki/katello/qpid_router_server.crt's Issuer is the old hostname; the subject is the new name, and the X509v3 Authority Key Identifier is the old name.

Actions #14

Updated by Jonathon Turel almost 6 years ago

Did you observe that behavior after running katello-change-hostname? You may want to give it a try!

Actions #15

Updated by Ewoud Kohl van Wijngaarden 9 months ago

  • Status changed from Need more information to Rejected

Closing because of it age.

Actions

Also available in: Atom PDF