Foreman Blog

Foreman Community Newsletter (June 2019)

We're back! 3 months of goodies, including the 1.22 release, my new beginning as community lead, conference roundup, and community survey results!

2019 Foreman Survey Analysis

As per usual, we ran a Foreman Community Survey in February order to give you all the opportunity to tell us how we’re doing - where it’s good, and where it’s bad. That survey closed a while ago, and I’m here to show you the results.

Foreman Proxy Registration Protocol v2 explained

Foreman Proxy Registration Protocol v2 explained

In Foreman 1.22 the Proxy registration protocol changed. This blog takes a look at how it was, how it’ll be and why this is an improvement.

Classic Foreman Proxy registration

Before diving in, there are some things one must know. First of all, there is our main application Foreman. For this it doesn’t matter which plugins are installed, including Katello. It is a Ruby on Rails application. It’s typically deployed via HTTPS on a URL like https://foreman.example.com. Some features also need HTTP (http://foreman.example.com) but that’s out of scope for this blog. When using the foreman-installer it defaults to using the Puppet CA for certificates, but this is out of historical convenience. In the Katello scenario a custom CA implementation is used instead.

Then there is also Foreman Proxy, a Sinatra application, written in Ruby, presenting a REST API. It’s known under multiple names. The git repository is named smart-proxy and Katello has known it as Capsule. You may also see a reference to Content Proxy. For this blog we can consider these mostly the same. The difference is typical deployment. Where Foreman defaults to HTTPS on port 8443 (https://proxy.example.com:8443) and HTTP on port 8000 (http://proxy.example.com:8000), Katello changes the HTTPS port to 9090 because of a conflict with Candlepin. Like Foreman itself, the Proxy is deployed using the same certificate stack. That means Puppet CA or the Katello generated certificates. It is important to know because client certificates are used to authenticate.

Classic registration starts with the /features REST endpoint. You can easily query it since there’s no authentication. Using curl -s https://$(hostname -f):8443/features | jq we get the following result:

[
  "logs",
  "puppet",
  "tftp"
]

This is also what Foreman does: it queries the features. These are translated into Foreman’s internal representation (Logs, Puppet and TFTP in this case) and stored in its database. Later when creating a host, the list of possible Puppet servers is the list of Proxies with the feature Puppet. Similarly domains use the DNS feature to select potential servers where subnets use have a DHCP selection (for IP delegation) and DNS for reverse DNS. This is the heart of provisioning orchestration in Foreman.

The last (optional) component is the installer, which defaults to registering freshly installed Proxies. For this there are three critical parameters: a Foreman URL and an oauth consumer secret and its key. This is used to authenticate to the Foreman REST API. It ensures the proxy exists with the correct features. The Proxy starts up even if some features fail to initialize. Because the installer knows which features should be present, this is a way to detect which features failed. It reports an error if a mismatch is found.

Version 2

In Foreman Proxy 1.22 a new /v2/features REST endpoint is introduced. This is a data structure which allows to relay a lot more internal information. Because of this, it is required to authenticate using client certificates. Using $ curl -s https://$(hostname -f):9090/v2/features --cert /etc/foreman-proxy/foreman_ssl_cert.pem --key /etc/foreman-proxy/foreman_ssl_key.pem | jq we get:

{
  "facts": {
    "http_enabled": false,
    "https_enabled": false,
    "settings": {},
    "state": "disabled",
    "capabilities": []
  },
  "dns": {
    "http_enabled": false,
    "https_enabled": false,
    "settings": {
      "use_provider": null
    },
    "state": "disabled",
    "capabilities": []
  },
  "templates": {
    "http_enabled": true,
    "https_enabled": true,
    "settings": {},
    "state": "running",
    "capabilities": []
  },
  "tftp": {
    "http_enabled": false,
    "https_enabled": true,
    "settings": {
      "tftp_servername": null
    },
    "state": "running",
    "capabilities": []
  },
  "dhcp": {
    "http_enabled": false,
    "https_enabled": false,
    "settings": {
      "use_provider": null
    },
    "state": "disabled",
    "capabilities": []
  },
  "puppetca": {
    "http_enabled": false,
    "https_enabled": false,
    "settings": {
      "use_provider": null
    },
    "state": "failed",
    "capabilities": []
  },
  "puppet": {
    "http_enabled": false,
    "https_enabled": true,
    "settings": {
      "puppet_url": "https://puppet.example.com:8140",
      "use_provider": [
        "puppet_proxy_puppet_api"
      ]
    },
    "state": "running",
    "capabilities": []
  },
  "bmc": {
    "http_enabled": false,
    "https_enabled": false,
    "settings": {},
    "state": "disabled",
    "capabilities": []
  },
  "realm": {
    "http_enabled": false,
    "https_enabled": false,
    "settings": {
      "use_provider": null
    },
    "state": "disabled",
    "capabilities": []
  },
  "logs": {
    "http_enabled": false,
    "https_enabled": true,
    "settings": {},
    "state": "running",
    "capabilities": []
  },
  "httpboot": {
    "http_enabled": false,
    "https_enabled": false,
    "settings": {},
    "state": "disabled",
    "capabilities": []
  }
}

As can be seen, it is now a mapping of the feature name to another map representing the feauture’s state.

http_enabled and https_enabled are the result of enabled in the feature’s setting file. Its value can be http, https, true (both http and https) or false (disabled). The installer configures most features using https for increased security, but some features need to listen on HTTP. Templates and HTTPBoot are examples of this because connecting clients wouldn’t be able to verify HTTPS or not deal with it at all.

Where we previously relied on the installer to verify, the new datastructure exposes state. The current known values for this are uninitialized, loaded, starting, running, disabled and failed. As can be seen in the example, it’s mostly disabled or running. The interesting one is failed. Now the Foreman application itself can detect misconfigurations although this hasn’t been implemented. In the above output the puppetca feature has failed which is also why it wasn’t present in the classic registration. At this moment the state is the internal state of the proxy and hasn’t been formalized.

Exposing settings

Every feature can have settings. They all have their own settings file. While working with the proxy, it turns out these are useful to know from the Foreman side. One example is TFTP server name. The Foreman needs to know this to properly create the DHCP record so a REST API call was created. This does mean that during provisioning we do an API call for a fairly static value. The templates plugin has a similar API to get the external server name. There were some others that did the same thing. The settings allow us to create a unified API and with a bonus that it doesn’t need to call out.

This does come at the cost that you have to refresh the registration after changing the config. Currently this is mitigated by the installer since it refreshes the registration after config files change. In the future automatic refreshing can be implemented. With the classic registration that could cause problems because it was impossible to know if a feature failed or was explicitly removed. The new API exposes sufficient information to make intelligent choices here, like not removing an existing but failed feature while not adding an unknown failed feature.

The observant reader also notices that features that can have multiple providers (implementations) expose which one is active. This is mostly intended for debugging or informational purposes. Foreman should not make any decission based on a provider.

Providers also have their own config files and exposing those settings also works. It does mean a provider can’t expose a setting from its parent.

Implementing this in a feature is easy by using expose_setting in the Proxy Plugin DSL:

module Proxy::TFTP
  class Plugin < ::Proxy::Plugin
    plugin :tftp, ::Proxy::VERSION

    # ...

    expose_setting :tftp_servername
  end
end

This allows some more advanced use cases as well. Let’s consider the Puppet support. When a host has a Puppet Proxy assigned, it has a Puppet server hostname. Right now this is the same hostname that Foreman uses to connect to the Proxy. In multi-homing setups this breaks. Issue #26194 describes it in more detail, but the short summary is that if there’s a management LAN just for Foreman and its proxies and a WAN where clients reside, it currently means that clients need to connect to the management LAN interface, even if the Proxy connects to the WAN IP. Obviously this is undesired. By exposing the puppet_url setting on the Puppet Proxy plugin to the Foreman, it can provide the correct hostname to clients. Since URLs can contain ports, we can also support non-standard ports. As part of the Pulp 3 effort, smart_proxy_pulp started to expose this information. In the future this can also be used to provide services that are on external servers that are otherwise unreachable to the Foreman server.

While this part wasn’t part of my initial design, the new datastructure easily allowed this almost for free.

Showing capabilities

This was the original reason for the whole change, which is why I still think of this as the capabilities effort. Interestingly enough the other fields have seen more use. Exposing failed modules has allowed a colleague to write a script assisting in debugging installations. Exposing settings has seen adoption in a few places. This is because they’re very easy to use. Capabilities require much more support on the Foreman side.

My initial desire was to expand the DNS implementation. Some providers are capable of managing zones themselves. PowerDNS and Infoblox provide full APIs for this and these could be used. However, in the classic scenario there was no way of knowing whether the Proxy actually supported this other than trying. This is not a good user experience. If the Proxy could tell if it had the DNS_ZONE_MANAGEMENT capability, we could tell the user the domain was actually created. If it didn’t we could warn the user they needed to manually ensure the domain was created. Perhaps more fine grained access is also possible, but the idea is still that simple strings indicate a capability.

Luckily Justin, who actually did the hard work of making my proof of concept code production ready, uses this in smart_proxy_pulp. The use case for Pulp 3 was various file types. In Katello with Pulp 2 there’s a static configuration in katello.yaml, but with Pulp 3 we can query the Pulp API for supported file types and communicate this. Less configuration and more autodetection is a good user experience.

Implementation is straight forward using the capability in the Proxy Plugin DSL. The static example is not actually in the code, but it illustrates the initial design.

module PulpProxy
  class Pulp3Plugin < ::Proxy::Plugin
    plugin "pulp3", ::PulpProxy::VERSION

    # ...

    # Static capability
    capability :pulp3

    # Dynamic capability that returns a single symbol or a list of symbols
    capability lambda { Pulp3Client.capabilities }
  end
end

This relies on a definition elsewhere that queries the Pulp API.

module PulpProxy
  class Pulp3Client
    # ...

    def self.capabilities
      body = JSON.parse(get("api/v3/status/").body)
      body['versions'].map{ |item| item['component'] }
    rescue => e
      logger.error("Could not fetch capabilities: #{e.message}")
      []
    end
  end
end

As always, relying on a separate service can cause failures. In this case, we assume the Pulp 3 API is reliable. Unlucky timing could hide all capabilities. A more advanced solution would regularly poll Pulp 3 and cache the result. When implementing capabilities, it is advisable to use this but out of scope for this blog.

Updating Foreman

The Foreman registration also needed to be updated. Previously it stored a Smart Proxy entity with Feature entities. To deal with the new datastructures the Feature entity was expanded to store a JSON serialized array of capabilities and a JSON serialized hash of settings. When refreshing features it first tries the /v2/features endpoint. Older proxies will return a 404 and the older /features endpoint is queried. In this case no settings nor capabilities are available.

Note that plugins can implement dynamic capabilities that call into backend services which can have an impact on the availability. For now it’s best to not call this too often, but time will better indicate requirements in this area.

Conclusion

The new registration framework allows communicating much more information and allow new solutions. I’m looking forward to what the developers come up with.

Merge time vs Change complexity in Foreman Core

In this post I’m going to dig into some of the data we have on GitHub about how our community functions. In particular, I’m looking to investigate some of the issues raised on Discourse to help draw some conclusions about how to structure Foreman’s code-and-review processes.

The Goal

Let’s start with defining what we’re going to look at. I’m going to focus on Tomer’s 8th point:

“Bigger changes are harder to review.”

Review complexity is a hard thing to quantify, so the size of the change does seem like a sensible proxy. Tomer then goes on to say:

“A while ago I saw a talk by Greg Kroah-Hartman, where he mentioned that they usually don’t accept patches longer than about 100 line change into the stable kernel.”

OK, so that gives us one thing to look at - the number of lines of changes. However, I don’t think that’s the whole story. So I’m going to add a second dimension:

“The number of files touched also increases review complexity.”

I think this stands to reason, in most cases - 100 changes in one files is easier to review than 20 changes in 5 files, because you have more inputs/outputs to track. That’s not going to be 100% true (e.g. consider re-factoring, where you make the same 20 line change 5 times…), but overall, I think it’s worth investigating.

The Output

We need to think about how to investigate this. Getting data isn’t hard - GitHub has a solid API, we can grab all the PRs for a given time period, and ask what the merge time is, the number of files touched, and the lines of changes. We can define this:

  • merge time = date of merge - date of creation
  • lines of change = lines added + lines removed
  • files touched = self-explanatory (I hope!)

How will we study this? A naive approach would be to take an average (probably the median or perhaps the geometric mean), and this would be a good first approximation. However, I feel it missed two things. Firstly, not all PRs are merged. At any given time, some will still be open, and they should be accounted for. Second, I think we’re not so interested in the “50% line” that averages would give us. I think it’s far better to think about the classic “80/20” rule we apply so often in development - we’d like 80% of our PRs to be merged in good time. So that leads us to ask a question like this:

“For a given amount of lines-of-change and files-touched, after what time do we have an 80% chance of the PR being merged?”

This is a well-specified quantitative question that we can answer. It leads us to a technique known as Survival Analysis which is ideal for any time-to-event dataset, which is what we have here. Let’s get to it!

Exploration

First, a little house cleaning. We need to only use PRs from after 2016-10-01. This is the date that GitHub added the “Rebase” and “Squash” options to the UI, before which we can not reliably tell a merged PR from a closed one (as we used to close and then merge on the cmdline). Obviously, that means we also want to filter out things which are state == closed rather than merged, as that means they weren’t merged.

That still leaves 2235 PRs to study, which is plenty.

Let’s take a look at the data, shall we? Since we’re interested in 3 things (LoC, files, and merge time) then we can do a nice scatter plot:

Exploratory Scatter Plot

Hmm, some really big but also very quick PRs there. Possible outliers?

Outlier PRs
number title loc changed_files
5016 i18n - extracting new, pulling from tx 21953 33
5431 i18n - extracting new, pulling from tx 18673 33
5291 i18n - extracting new, pulling from tx 8316 33
6056 Fixes #16294 - Remove noVNC from vendors 7694 25

Ah, i18n, of course. Definitely an outlier. Let’s drop PRs matching /^i18n/ and look again…

Cleaned Scatter Plot

Better, but wow, some patches really change a lot of files. Still, this seems usable now, and we can sum up the dataset with a quick table:

PR breakdown: Lines of change vs. files touched
<10 10-49 >50
<100 1827 33 0
100-2999 170 163 28
>3000 1 2 8

For Survival Analysis, it’s helpful to pick some “typical values” to plot the curves with. Looking at this table, it seems to me that we can pick 100 & 2000 lines of change to compare, and 1, 10 & 50 files changed. Note that this is just for plotting - the statistical model considers the whole distribution of values in each dimension that we ask it to consider.

Lines of code only

Here’s our first pair of survival curves:

Lines of Code Analysis

How do we interpret this? Well, the y-axis shows us the probability of a PR still being open after X time. Intuitively, all PRs start off with 100% chance to be open when they are created. What we can see is that the red curve (small changes) decreases far more rapidly than the blue curve. That’s what we expect - small changes are more likely to be merged quickly. (The shaded area is the 95% confidence interval, which is larger for 2000 LoC as we have a smaller dataset - the important point is the areas do not overlap).

If we return to the “80/20 rule”, and ask what’s the time at which a PR has an 80% chance of being merged, then we see a real difference - 25.56 days for small changes, 142.07 days for large ones; nearly 6 times higher!

(For the real stats nerds, the actual model is a CoxPH model, which results in a relative risk increase of 0.0335% per line of code added)

Files touched

We can repeat this analysis for files touched:

Files Changes Analysis

The effect here is definitely smaller, so I’ve plotted three values for files, just to illustrate. Overall, though, it’s the same story - the 80% merge chance here is at 20.36 days for 1 file vs 30.85 days for 10 files - a 50% increase.

Putting it all together

Let’s combine these models

LoC and Files Analysis

So this is really interesting! The lines are very close together for the lines-of-code axis, so I’ve zoomed in a little (reducing both the x and y axis ranges a little).

What we see is that despite the lines of code appearing to have the larger impact when we studied it separately, the combined effect shows that number of files is far more important (this is known as Simpson’s paradox. Indeed, if we look at the model, we can quantify how much we expect the merge time to increase by, per line-of-change or file-touched:

Hazard increase per formula term (%)
Term Coefficient Confidence (upper) Confidence (lower)
loc 0.0012857 0.0112858 -0.0087154
changed_files 1.7208675 2.2305595 1.2085184

Here we see the confidence interval for the risk-increase per line of code, or per file touched. Note that now, for this combined model, the LoC risk estimate includes 0 - that is, it is no longer statistically significant. For files though, it’s pretty big - an increase of between 1.2% and 2.2% per file.

Conclusions

We asked two questions - does lines-of-change and files-touched affect the merge time? Unquestionably, the answer is yes. But there’s really two results here.

Firstly, in terms of the overall dataset, the big win is keeping the number of files changed to a minimum. Only when that has been satisfied, does the amount of changes have an effect. I think that’s a nice result, as it doesn’t suggest we should all go and break our PRs into 20 smaller PRs - rather, that we should strive to make our changes local to the area of the code we’re working in.

What does this mean for Tomer’s proposal? I think Foreman needs a culture shift. It needs to be OK to merge something non-functional, so long as it’s not a breaking change, and for an accepted RFC. In this way, code can be reviewed quicker and agreed on for a single part of the codebase (e.g. the orchestration, for example) in order to keep the amount of changes and the number of files touched to a minimum. Then a developer moves on to part N+1 of the RFC, building on what was just merged. This should make the tasking of reviewing PRs a lighter burden.

There’s probably more we could study here. Merge time by Redmine category is an obvious one, as is looking at other GitHub repos (perhaps katello/katello?). Also adding in closed as an end-state and doing a “competing risk” model could be fun. But that’s for a future analysis - suggestions are welcome!

Thanks for reading! As this is my first post in my new role, I’m looking forward to feedback on this :)

Foreman Community Newsletter (February 2019)

We're back! 2 months of goodies, including the 1.21 release, my departure as community lead, conference roundup, and RFCs that need your input!

Foreman 1.22.0 has been released! Follow the quick start to install it.

Foreman 1.21.3 has been released! Follow the quick start to install it.

Foreman 1.20.3 has been released! Follow the quick start to install it.