Adopting the data plane recovering a Compute failed to adopt

Adopting the {rhos_long} data plane involves the following steps:

  1. Stop any remaining services on the {rhos_prev_long} ({OpenStackShort}) {rhos_prev_ver} control plane.

  2. Deploy the required custom resources.

  3. Perform a fast-forward upgrade on Compute services from {OpenStackShort} {rhos_prev_ver} to {rhos_acro} {rhos_curr_ver}.

  4. If applicable, adopt Networker nodes to the {rhos_acro} data plane.

After the {rhos_acro} control plane manages the newly deployed data plane, you must not re-enable services on the {OpenStackShort} {rhos_prev_ver} control plane and data plane. If you re-enable services, workloads are managed by two control planes or two data planes, resulting in data corruption, loss of control of existing workloads, inability to start new workloads, or other issues.

Stopping infrastructure management and Compute services

You must stop cloud Controller nodes, database nodes, and messaging nodes on the {rhos_prev_long} {rhos_prev_ver} control plane. Do not stop nodes that are running the Compute, Storage, or Networker roles on the control plane.

The following procedure applies to a single node standalone {OpenStackPreviousInstaller} deployment. You must remove conflicting repositories and packages from your Compute hosts, so that you can install libvirt packages when these hosts are adopted as data plane nodes, where modular libvirt daemons are no longer running in podman containers.

Prerequisites
  • Define the shell variables. Replace the following example values with values that apply to your environment:

    EDPM_PRIVATEKEY_PATH="~/install_yamls/out/edpm/ansibleee-ssh-key-id_rsa"
    declare -A computes
    computes=(
      ["compute02.localdomain"]="172.22.0.110"
      ["compute03.localdomain"]="172.22.0.112"
    )
  • Remove the conflicting repositories and packages from all Compute hosts:

    PacemakerResourcesToStop=(
                    "galera-bundle"
                    "haproxy-bundle"
                    "rabbitmq-bundle")
    
    echo "Stopping pacemaker services"
    for i in {1..3}; do
        SSH_CMD=CONTROLLER${i}_SSH
        if [ ! -z "${!SSH_CMD}" ]; then
            echo "Using controller $i to run pacemaker commands"
            for resource in ${PacemakerResourcesToStop[*]}; do
                if ${!SSH_CMD} sudo pcs resource config $resource; then
                    ${!SSH_CMD} sudo pcs resource disable $resource
                fi
            done
            break
        fi
    done

Preparing compute03 to fail

In this chapter we will misconfigure the DNS configuration so that the adoption process will fail while downloading the packages.

Procedure
  • From the bastion, ssh to the compute03 node:

    ssh -i /home/lab-user/.ssh/my-guidkey.pem cloud-user@compute03

    Truncate the resolv.conf:

    sudo cp /etc/resolv.conf /root/resolv.conf.bck
    sudo truncate -s 0 /etc/resolv.conf

Adopting Compute services to the {rhos_acro} data plane with a failed Compute Node

Adopt your Compute (nova) services to the {rhos_long} data plane.

Prerequisites
  • You have stopped the remaining control plane nodes, repositories, and packages on the {compute_service_first_ref} hosts. For more information, see Stopping infrastructure management and Compute services.

  • In the bastion, create the dataplane network (IPAM):

    cd /home/lab-user/labrepo/content/files/
    oc apply -f osp-ng-dataplane-netconfig-adoption.yaml
  • Get the libvirt secret password:

    LIBVIRT_PASSWORD=$(cat ~/tripleo-standalone-passwords.yaml | grep ' LibvirtTLSPassword:' | awk -F ': ' '{ print $2; }')
    LIBVIRT_PASSWORD_BASE64=$(echo -n "$LIBVIRT_PASSWORD" | base64)
  • Create the libvirt secret:

    oc apply -f - <<EOF
    apiVersion: v1
    kind: Secret
    metadata:
      name: libvirt-secret
      namespace: openstack
    type: Opaque
    data:
      LibvirtPassword: ${LIBVIRT_PASSWORD_BASE64}
    EOF
  • You have defined the shell variables to run the script that runs the fast-forward upgrade:

    PODIFIED_DB_ROOT_PASSWORD=$(oc get -o json secret/osp-secret | jq -r .data.DbRootPassword | base64 -d)
    
    alias openstack="oc exec -t openstackclient -- openstack"
    declare -A computes
    computes=(
      ["compute02.localdomain"]="172.22.0.110"
      ["compute03.localdomain"]="172.22.0.112"
    )
Procedure
  1. Create a ssh authentication secret for the data plane nodes:

  2. Create an SSH authentication secret for the data plane nodes:

    oc create secret generic dataplane-ansible-ssh-private-key-secret \
    --save-config \
    --dry-run=client \
    --from-file=authorized_keys=/home/lab-user/.ssh/my-guidkey.pub \
    --from-file=ssh-privatekey=/home/lab-user/.ssh/my-guidkey.pem \
    --from-file=ssh-publickey=/home/lab-user/.ssh/my-guidkey.pub \
    -n openstack \
    -o yaml | oc apply -f-
  3. Generate an ssh key-pair nova-migration-ssh-key secret:

    cd "$(mktemp -d)"
    ssh-keygen -f ./id -t ecdsa-sha2-nistp521 -N ''
    oc get secret nova-migration-ssh-key || oc create secret generic nova-migration-ssh-key \
      -n openstack \
      --from-file=ssh-privatekey=id \
      --from-file=ssh-publickey=id.pub \
      --type kubernetes.io/ssh-auth
    rm -f id*
    cd -
  4. As we use a local storage back end for libvirt, create a nova-compute-extra-config service to remove pre-fast-forward workarounds and configure Compute services to use a local storage back end:

    cat << EOF | oc apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: nova-extra-config
      namespace: openstack
    data:
      19-nova-compute-cell1-workarounds.conf: |
        [workarounds]
        disable_compute_service_check_for_ffu=true
    EOF
    The secret nova-cell<X>-compute-config auto-generates for each cell<X>. You must specify values for the nova-cell<X>-compute-config and nova-migration-ssh-key parameters for each custom OpenStackDataPlaneService CR that is related to the {compute_service}.

    The resources in the ConfigMap contain cell-specific configurations.

  5. Create a secret for the subscription manager:

    oc create secret generic subscription-manager \
    --from-literal rhc_auth='{"login": {"username": "<subscription_manager_username>", "password": "<subscription_manager_password>"}}'
    • Replace <subscription_manager_username> with the applicable user name.

    • Replace <subscription_manager_password> with the applicable password.

  6. Create a secret for the Red Hat registry:

    oc create secret generic redhat-registry \
    --from-literal edpm_container_registry_logins='{"registry.redhat.io": {"<registry_username>": "<registry_password>"}}'
    • Replace <registry_username> with the applicable user name.

    • Replace <registry_password> with the applicable password.

  7. Create the OpenStackDataPlaneNodeSet CRs corresponding to compute02 and compute03:

    oc apply -f osp-ng-dataplane-node-set-deploy-adoption-compute-2.yaml
    oc apply -f osp-ng-dataplane-node-set-deploy-adoption-compute-3.yaml
    • Take some time to go through the osp-ng-dataplane-adoption-compute-2.yaml and osp-ng-dataplane-adoption-compute-3.yaml files. We have configured the edpm_ovn_bridge_mappings with "datacentre:br-ex"

  8. Run the pre-adoption validation:

    1. Create the validation service:

      cat << EOF | oc apply -f -
      apiVersion: dataplane.openstack.org/v1beta1
      kind: OpenStackDataPlaneService
      metadata:
        name: pre-adoption-validation
      spec:
        playbook: osp.edpm.pre_adoption_validation
      EOF
    2. Create a OpenStackDataPlaneDeployment CR that runs only the validation:

      cat << EOF | oc apply -f -
      apiVersion: dataplane.openstack.org/v1beta1
      kind: OpenStackDataPlaneDeployment
      metadata:
        name: openstack-pre-adoption
      spec:
        nodeSets:
        - compute-2
        - compute-3
        servicesOverride:
        - pre-adoption-validation
      EOF
    3. When the validation is finished, confirm that the status of the Ansible EE pods is Completed:

      watch oc get pod -l app=openstackansibleee
      oc logs -l app=openstackansibleee -f --max-log-requests 20
    4. Wait for the deployment to reach the Ready status:

      oc wait --for condition=Ready openstackdataplanedeployment/openstack-pre-adoption --timeout=10m

      If any openstack-pre-adoption validations fail, you must reference the Ansible logs to determine which ones were unsuccessful, and then try the following troubleshooting options:

      • If the hostname validation failed, check that the hostname of the data plane node is correctly listed in the OpenStackDataPlaneNodeSet CR.

      • If the kernel argument check failed, ensure that the kernel argument configuration in the edpm_kernel_args and edpm_kernel_hugepages variables in the OpenStackDataPlaneNodeSet CR is the same as the kernel argument configuration that you used in the {rhos_prev_long} ({OpenStackShort}) {rhos_prev_ver} node.

      • If the tuned profile check failed, ensure that the edpm_tuned_profile variable in the OpenStackDataPlaneNodeSet CR is configured to use the same profile as the one set on the {OpenStackShort} {rhos_prev_ver} node.

  9. Remove the remaining {OpenStackPreviousInstaller} services:

    1. Create an OpenStackDataPlaneService CR to clean up the data plane services you are adopting:

      cat << EOF | oc apply -f -
      apiVersion: dataplane.openstack.org/v1beta1
      kind: OpenStackDataPlaneService
      metadata:
        name: tripleo-cleanup
      spec:
        playbook: osp.edpm.tripleo_cleanup
      EOF
    2. Create the OpenStackDataPlaneDeployment CR to run the clean-up:

      cat << EOF | oc apply -f -
      apiVersion: dataplane.openstack.org/v1beta1
      kind: OpenStackDataPlaneDeployment
      metadata:
        name: tripleo-cleanup
      spec:
        nodeSets:
        - compute-2
        - compute-3
        servicesOverride:
        - tripleo-cleanup
      EOF
  10. When the clean-up is finished, deploy the OpenStackDataPlaneDeployment CR:

    cat << EOF | oc apply -f -
    apiVersion: dataplane.openstack.org/v1beta1
    kind: OpenStackDataPlaneDeployment
    metadata:
      name: compute-adoption
    spec:
      nodeSets:
      - compute-2
      - compute-3
    EOF
  11. You should see that compute02 jobs are progressing, however compute03 fails in the redhat service that takes care of the subscription

    oc get jobs -n openstack
NAME                                   COMPLETIONS   DURATION   AGE
bootstrap-compute-adoption-compute-2   0/1           29s        29s
keystone-cron-29050201                 1/1           6s         8m38s
redhat-compute-adoption-compute-2      1/1           3m14s      3m43s
redhat-compute-adoption-compute-3      0/1           3m43s      3m43s
  1. If we check the compute03 logs, there are errors to download the RPM packages

    oc logs job/redhat-compute-adoption-compute-3 -n openstack
TASK [redhat.rhel_system_roles.rhc : Handle system subscription] ***************
task path: /usr/share/ansible/collections/ansible_collections/redhat/rhel_system_roles/roles/rhc/tasks/main.yml:15
included: /usr/share/ansible/collections/ansible_collections/redhat/rhel_system_roles/roles/rhc/tasks/subscription-manager.yml for compute03

TASK [redhat.rhel_system_roles.rhc : Ensure required packages are installed] ***
task path: /usr/share/ansible/collections/ansible_collections/redhat/rhel_system_roles/roles/rhc/tasks/subscription-manager.yml:3
fatal: [compute03]: FAILED! => {"changed": false, "msg": "Failed to download metadata for repo 'openstack-17.1-for-rhel-9-x86_64-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried", "rc": 1, "results": []}

NO MORE HOSTS LEFT *************************************************************

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
compute03                  : ok=5    changed=0    unreachable=0    failed=1    skipped=3    rescued=0    ignored=0

Option1: Recover compute03

  1. If you want to recover the compute03 connect to the compute03:

    ssh -i /home/lab-user/.ssh/my-guidkey.pem cloud-user@compute03
  2. Revert back the DNS configuration:

    sudo cp /root/resolv.conf.bck /etc/resolv.conf
  3. Delete the facts.d folder as it’s used to mark the execution of bootstrap_command as completed. By deleting the bootstrap command can be reexecuted:

    sudo rm -rf /etc/ansible/facts.d
  4. Create a new OpenStackDataPlaneDeployment adding only the compute-3 nodeset corresponding to the compute03

    cat << EOF | oc apply -f -
    apiVersion: dataplane.openstack.org/v1beta1
    kind: OpenStackDataPlaneDeployment
    metadata:
      name: openstack-edpm-compute-recover-3
    spec:
      nodeSets:
      - compute-3
    EOF
  5. You should see that now compute03 jobs are progressing

    oc get jobs -n openstack
Verification
  1. Confirm that all the Ansible EE pods reach a Completed status:

    watch oc get pod -l app=openstackansibleee
    oc logs -l app=openstackansibleee -f --max-log-requests 30
  2. Wait for the data plane node set to reach the Ready status:

    oc wait --for condition=Ready osdpns/openstack-edpm-compute-recover-3 --timeout=30m
  3. Verify that the {networking_first_ref} agents are running:

    oc exec openstackclient -- openstack network agent list
    +--------------------------------------+------------------------------+------------------------+-------------------+-------+-------+----------------------------+
    | ID                                   | Agent Type                   | Host                   | Availability Zone | Alive | State | Binary                     |
    +--------------------------------------+------------------------------+------------------------+-------------------+-------+-------+----------------------------+
    | 174fc099-5cc9-4348-b8fc-59ed44fcfb0e | DHCP agent                   | standalone.localdomain | nova              | :-)   | UP    | neutron-dhcp-agent         |
    | 10482583-2130-5b0d-958f-3430da21b929 | OVN Metadata agent           | standalone.localdomain |                   | :-)   | UP    | neutron-ovn-metadata-agent |
    | a4f1b584-16f1-4937-b2b0-28102a3f6eaa | OVN Controller agent         | standalone.localdomain |                   | :-)   | UP    | ovn-controller             |
    +--------------------------------------+------------------------------+------------------------+-------------------+-------+-------+----------------------------+

Option2: Remove compute03 from your deployment

+ If you do not want to recover the compute03 but to remove the compute03 from your cloud, in the bastion, list the compute services:

+

alias openstack="oc exec -t openstackclient -- openstack"
openstack compute service list
+--------------------------------------+----------------+------------------------+----------+---------+-------+----------------------------+
| ID                                   | Binary         | Host                   | Zone     | Status  | State | Updated At                 |
+--------------------------------------+----------------+------------------------+----------+---------+-------+----------------------------+
| a1419dda-fff2-4da6-8e01-94b20ffe5ecd | nova-conductor | nova-cell0-conductor-0 | internal | enabled | up    | 2025-03-26T20:53:04.000000 |
| ce2371aa-a0d6-4e17-b7ca-0a2eb54b4aef | nova-scheduler | nova-scheduler-0       | internal | enabled | up    | 2025-03-26T20:53:06.000000 |
| cdb3c0b1-ccd0-43de-a2b2-754b10e5627b | nova-compute   | compute02.localdomain  | nova     | enabled | up    | 2025-03-26T20:53:06.000000 |
| 092f5d2d-f4fb-48c2-8f55-fa83ceaa9f7a | nova-compute   | compute03.localdomain  | nova     | enabled | down  | 2025-03-26T16:59:12.000000 |
| 5db4c90e-3cbb-4ba6-890d-5487e8e0b7fc | nova-conductor | nova-cell1-conductor-0 | internal | enabled | up    | 2025-03-26T20:53:12.000000 |
+--------------------------------------+----------------+------------------------+----------+---------+-------+----------------------------+
  1. Delete the VM hosted in the compute03 :

    openstack server delete test-server-compute-03
  2. Delete the compute03 compute service:

    openstack compute service delete UUID_of_compute03
Next steps

Performing a fast-forward upgrade on Compute service

You must upgrade the Compute services from {rhos_prev_long} {rhos_prev_ver} to {rhos_long} {rhos_curr_ver} on the control plane and data plane by completing the following tasks:

  • Update the cell1 Compute data plane services version.

  • Remove pre-fast-forward upgrade workarounds from the Compute control plane services and Compute data plane services.

  • Run Compute database online migrations to update live data.

Procedure
  1. Patch the OpenStackControlPlane CR to remove the pre-fast-forward upgrade workarounds from the Compute control plane services:

    oc patch openstackcontrolplane openstack -n openstack --type=merge --patch '
    spec:
      nova:
        template:
          cellTemplates:
            cell0:
              conductorServiceTemplate:
                customServiceConfig: |
                  [workarounds]
                  disable_compute_service_check_for_ffu=false
            cell1:
              metadataServiceTemplate:
                customServiceConfig: |
                  [workarounds]
                  disable_compute_service_check_for_ffu=false
              conductorServiceTemplate:
                customServiceConfig: |
                  [workarounds]
                  disable_compute_service_check_for_ffu=false
          apiServiceTemplate:
            customServiceConfig: |
              [workarounds]
              disable_compute_service_check_for_ffu=false
          metadataServiceTemplate:
            customServiceConfig: |
              [workarounds]
              disable_compute_service_check_for_ffu=false
          schedulerServiceTemplate:
            customServiceConfig: |
              [workarounds]
              disable_compute_service_check_for_ffu=false
    '
  2. Wait until the Compute control plane services CRs are ready:

    oc wait --for condition=Ready --timeout=300s Nova/nova
  3. Complete the steps in Adopting Compute services to the {rhos_acro} data plane.

Option 1: compute03 was recovered in cluster

  1. Remove the pre-fast-forward upgrade workarounds from the Compute data plane services:

    cat << EOF | oc apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: nova-extra-config
      namespace: openstack
    data:
      20-nova-compute-cell1-workarounds.conf: |
        [workarounds]
        disable_compute_service_check_for_ffu=false
    ---
    apiVersion: dataplane.openstack.org/v1beta1
    kind: OpenStackDataPlaneDeployment
    metadata:
      name: openstack-nova-compute-ffu
      namespace: openstack
    spec:
      nodeSets:
        - compute-2
        - compute-3
      servicesOverride:
        - nova
    EOF

Option 2: compute03 was removed from your cluster

  1. Or use the next OpenStackDataplneDeployment if you have removed compute03 of your deployment:

    cat << EOF | oc apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: nova-extra-config
      namespace: openstack
    data:
      20-nova-compute-cell1-workarounds.conf: |
        [workarounds]
        disable_compute_service_check_for_ffu=false
    ---
    apiVersion: dataplane.openstack.org/v1beta1
    kind: OpenStackDataPlaneDeployment
    metadata:
      name: openstack-nova-compute-ffu
      namespace: openstack
    spec:
      nodeSets:
        - compute-2
      servicesOverride:
        - nova
    EOF
    The service included in the servicesOverride key must match the name of the service that you included in the OpenStackDataPlaneNodeSet CR. For example, if you use a custom service called nova-custom, ensure that you add it to the servicesOverride key.
  2. Wait for the Compute data plane services to be ready:

    oc wait --for condition=Ready openstackdataplanedeployment/openstack-nova-compute-ffu --timeout=5m
  3. Run Compute database online migrations to complete the fast-forward upgrade:

    oc exec -it nova-cell0-conductor-0 -- nova-manage db online_data_migrations
    oc exec -it nova-cell1-conductor-0 -- nova-manage db online_data_migrations
  4. Discover the Compute hosts in the cell:

    oc rsh nova-cell0-conductor-0 nova-manage cell_v2 discover_hosts --verbose
Verification
  1. Verify if the existing test VM instance is running:

    openstack server --os-compute-api-version 2.48 show --diagnostics test-server-compute-02 2>&1 || echo FAIL
  2. Verify if the Compute services can stop the existing test VM instance:

    openstack server list -c Name -c Status -f value | grep -qF "test ACTIVE" && openstack server stop test || echo PASS
    openstack server list -c Name -c Status -f value | grep -qF "test SHUTOFF" || echo FAIL
    openstack server --os-compute-api-version 2.48 show --diagnostics test 2>&1 || echo PASS
  3. Verify if the Compute services can start the existing test VM instance:

    openstack server list -c Name -c Status -f value | grep -qF "test-server-compute-02 SHUTOFF" && openstack server start test || echo PASS
    openstack server list -c Name -c Status -f value | grep -qF "test-server-compute-02 ACTIVE" && \
    openstack server --os-compute-api-version 2.48 show --diagnostics test-server-compute-02 --fit-width -f json | jq -r '.state' | grep running || echo FAIL
After the data plane adoption, the Compute hosts continue to run Red Hat Enterprise Linux (RHEL) {rhel_prev_ver}. To take advantage of RHEL {rhel_curr_ver}, perform a minor update procedure after finishing the adoption procedure.