Bug #23455: Upgrade and migration 1.15.6 to 1.16.1 (with Katello) - Foreman unstable and unusable - Katello - Foreman

Actions

Copy link

Bug #23455

closed

Upgrade and migration 1.15.6 to 1.16.1 (with Katello) - Foreman unstable and unusable

Added by Josh Pavel about 6 years ago. Updated 9 months ago.

Status:

Rejected

Priority:

Urgent

Assignee:

John Mitsch

Category:

Upgrades

Target version:

Difficulty:

Triaged:

Bugzilla link:

Pull request:

Fixed in Releases:

Found in Releases:

Red Hat JIRA:

Description

We are moving/migrating/upgrading our Foreman installation.

The backup/restore/upgrade went well (I hit https://bugzilla.redhat.com/show_bug.cgi?id=1556819, but the workaround offered there worked).

As I began attaching the smart-proxies back to the new Foreman, the first few went without issue. Eventually, though, I began to get flooded with messages like this in the log:

Apr 30 12:41:40 foreman-01 qpidd: 2018-04-30 12:41:40 [Protocol] error Error on attach: Node not found: pulp.agent.c0500867-d215-408e-a1ed-f8ed8b5e4075
Apr 30 12:41:40 foreman-01 qpidd³²⁴²⁹: 2018-04-30 12:41:40 [Protocol] error Error on attach: Node not found: pulp.agent.c0500867-d215-408e-a1ed-f8ed8b5e4075
Apr 30 12:41:40 foreman-01 qpidd³²⁴²⁹: 2018-04-30 12:41:40 [Protocol] error Error on attach: Node not found: pulp.agent.8278fc78-0b10-4ef5-b2de-b22b3e502007
Apr 30 12:41:40 foreman-01 qpidd: 2018-04-30 12:41:40 [Protocol] error Error on attach: Node not found: pulp.agent.8278fc78-0b10-4ef5-b2de-b22b3e502007
Apr 30 12:41:44 foreman-01 qpidd³²⁴²⁹: 2018-04-30 12:41:44 [Protocol] error Error on attach: Node not found: pulp.agent.a3754ba6-6067-4c46-899a-248cabc4a2d8
Apr 30 12:41:44 foreman-01 qpidd: 2018-04-30 12:41:44 [Protocol] error Error on attach: Node not found: pulp.agent.a3754ba6-6067-4c46-899a-248cabc4a2d8
Apr 30 12:41:45 foreman-01 qpidd: 2018-04-30 12:41:45 [Protocol] error Error on attach: Node not found: pulp.agent.033fe747-c608-4207-9176-a17cdedbff49
Apr 30 12:41:45 foreman-01 qpidd³²⁴²⁹: 2018-04-30 12:41:45 [Protocol] error Error on attach: Node not found: pulp.agent.033fe747-c608-4207-9176-a17cdedbff49
Apr 30 12:41:45 foreman-01 qpidd³²⁴²⁹: 2018-04-30 12:41:45 [Protocol] error Error on attach: Node not found: pulp.agent.af48ce1d-a58d-4b58-bf0f-c005e2c754fc
Apr 30 12:41:45 foreman-01 qpidd: 2018-04-30 12:41:45 [Protocol] error Error on attach: Node not found: pulp.agent.af48ce1d-a58d-4b58-bf0f-c005e2c754fc

...and I notice that the UI is locked up and unresponsive until a restart, in which case it is only operable for a few minutes.

There are constantly 5 or 6 postgres tasks taking up a full core that do not seem to resolve themselves, even when they've been let run for several hours.

33848 | -00:00:01.183384 | foreman   | SELECT  "sources".* FROM "sources" WHERE "sources"."digest" = $1  ORDER BY "sources"."id" ASC LIMIT 1
 33963 | -00:00:03.247328 | foreman   | SELECT  "sources".* FROM "sources" WHERE "sources"."digest" = $1  ORDER BY "sources"."id" ASC LIMIT 1
 33928 | -00:00:04.361622 | foreman   | SELECT  "sources".* FROM "sources" WHERE "sources"."digest" = $1  ORDER BY "sources"."id" ASC LIMIT 1
 33889 | -00:00:04.430823 | foreman   | SELECT  "sources".* FROM "sources" WHERE "sources"."digest" = $1  ORDER BY "sources"."id" ASC LIMIT 1
 33780 | -00:00:04.430876 | foreman   | SELECT  "sources".* FROM "sources" WHERE "sources"."digest" = $1  ORDER BY "sources"."id" ASC LIMIT 1
 33814 | -00:00:04.431819 | foreman   | SELECT  "sources".* FROM "sources" WHERE "sources"."digest" = $1  ORDER BY "sources"."id" ASC LIMIT 1

Actions

Copy link

Assignee changed from Christine Fouant to John Mitsch

Actions

Copy link

Updated by John Mitsch almost 6 years ago

Discussed this off-thread, where I suggested taring /var/lib/qpidd from the old machine and untaring on the new machine. I'm not sure why qpidd was missed in the backup, it should be included from the code logic

Actions

Copy link

Updated by Josh Pavel almost 6 years ago

As John stated, I manually copied over the .qpidd data from the old server to the new. That is now in place, but I am still having issues, specifically from what I can tell related to qdrouterd related functions.

As I see qdrouterd SSL errors, I looked at the config:
ssl-profile {
name: server
cert-db: /etc/pki/katello/certs/katello-default-ca.crt
cert-file: /etc/pki/katello/qpid_router_server.crt
key-file: /etc/pki/katello/qpid_router_server.key
}

If I look at /etc/pki/katello-certs-tools/certs/katello-default-ca.crt, I see the old hostname in there (and it has a date from 2017, before the server rename/migration).

The errors in qdrouterd.log that I see are:
2018-05-31 15:04:46.277338 +0000 SERVER (info) Connection from <redacted IP>:58952 (to 0.0.0.0:5646) failed: amqp:connection:framing-error SSL Failure: Unknown error
and
2018-05-31 15:04:56.347872 +0000 SERVER (info) Connection from <redacted IP>:41064 (to 0.0.0.0:5646) failed: amqp:connection:framing-error No valid protocol header found

Those seem coupled with:
2018-05-31 15:04:46.277869 +0000 ROUTER_LS (info) Link to Neighbor Router Lost - link_tag=8

Actions

Copy link

#10

Updated by John Mitsch almost 6 years ago

Josh,

Were you able to resolve this issue?

-John

Actions

Copy link

#11

Updated by Josh Pavel almost 6 years ago

Unfortunately no. Qdrouterd is still full of errors, and all hosts report "unknown" status, even with katello-agent running. I believe the certs are mismatched with the hostname. How can I regenerate those?

Actions

Copy link

#12

Updated by Justin Sherrill almost 6 years ago

You could try using katello-change-hostname to change to a new hostname (and then potentially change it back). That should force the regeneration of all the certs.

Do you have multiple smart proxies running running qdrouterd? There is currently a bug in proton that causes all qdrouterd's to start erroring if any one smart proxy has an ssl issue: https://issues.apache.org/jira/browse/PROTON-1587

What made you think there is some cert hostname mismatch? Have you found any evidence of that?

Actions

Copy link

#13

Updated by Josh Pavel almost 6 years ago

The reason I suspect it is that the ca issuer/subject/X509v3 Authority Key Identifier all contain the original hostname.
/etc/pki/katello/qpid_router_server.crt's Issuer is the old hostname; the subject is the new name, and the X509v3 Authority Key Identifier is the old name.

Actions

Copy link

#14

Updated by Jonathon Turel almost 6 years ago

Did you observe that behavior after running katello-change-hostname? You may want to give it a try!

Actions

Copy link

#15

Updated by Ewoud Kohl van Wijngaarden 9 months ago

Status changed from Need more information to Rejected

Closing because of it age.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Plugins » Katello

Custom queries

Bug #23455

Upgrade and migration 1.15.6 to 1.16.1 (with Katello) - Foreman unstable and unusable

Updated by Josh Pavel about 6 years ago

Updated by Andrew Kofink about 6 years ago

Updated by Josh Pavel about 6 years ago

Updated by John Mitsch about 6 years ago

Updated by Justin Sherrill about 6 years ago

Updated by Josh Pavel almost 6 years ago

Updated by Andrew Kofink almost 6 years ago

Updated by John Mitsch almost 6 years ago

Updated by Josh Pavel almost 6 years ago

Updated by John Mitsch almost 6 years ago

Updated by Josh Pavel almost 6 years ago

Updated by Justin Sherrill almost 6 years ago

Updated by Josh Pavel almost 6 years ago

Updated by Jonathon Turel almost 6 years ago

Updated by Ewoud Kohl van Wijngaarden 9 months ago