Project

General

Profile

Actions

Bug #17061

open

ability to reduce or otherwise process job data before saving to database

Added by Thomas McKay over 7 years ago. Updated over 7 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Difficulty:
Triaged:
Fixed in Releases:
Found in Releases:

Description

There are some cases where the amount of data stored with a task is very large. This could be the input or output json. Other cases there may be a desire to obfuscate data prior to saving (passwords?). The ability to control what and how much data is stored for tasks would allow the developer to control this to avoid issues.

Actions #1

Updated by Ivan Necas over 7 years ago

For the passwords, there is already #15624 available. Could you give me some (the more the better) examples of actual actions? So far, I was thinking about archiving phase, that would compress/reduce the data after the task is finished. Is there a case there the amount of data during the run is already an issue?

Actions #2

Updated by Thomas McKay over 7 years ago

There are cases where the incoming data to be processed is large. Two examples that come immediately to mind are virt-who reporting hypervisors (some times tens of thousands of hosts and guests) and reporting facts from subscription-manager (not as large but a large chunk of json).

An example of working around this can be found in this PR https://github.com/Katello/katello/pull/6405

Actions #3

Updated by Ivan Necas over 7 years ago

Do you have a specific idea on the mechanism? Foreman-tasks itself isn't dictating on how much data is stored in the task and the PR you mention has found a way how to do that inside the dynflow https://github.com/Katello/katello/pull/6405#issuecomment-257471640, which doesn't seem too complicated.

One downside the approach in [1] has is that we loose the data that are potential input for debugging. So two options are here:

1. introduce this as generic mechanism in foreman-tasks, but very similar to that in [1], with an addition to settings turning the trimming on/off. Another improvement that could be done would be trimming it after we know the finalize phase finished successfully: with the current implementation there, we might still trim the data before something else in finalize fails.

2. introduce a "compression" mechanism into the cleanup, that would not trim the data immediatelly after finish, but rather later with some cron

I tent to lean towards 1. as it allows to still get to the data from there if needed, while keeping the actions smaller by default.

[1] - https://github.com/Katello/katello/pull/6405

Actions #4

Updated by Ivan Necas over 7 years ago

Additional thing could be also, not cleaning just the data in actions, but perhaps even whole actions, keeping just some metadata about how big the task was, but not keeping all the execution steps/actions, while keeping the task still around for auditing/notifications purposes.

Actions #5

Updated by Justin Sherrill over 7 years ago

Yeah, i agree. I think formalizing the trimming would be fine. I had a bit of trouble making it too generic without making it super complicated. I think ideally I could just define something like:

def output_trimmed
  { :trim_this  =>  [:even, :supports, :nested, {:or_even_doubly => [:nested]}]
end

and you'd likely want it for output & input. I would like to have some indication that it was trimmed versus not existing as well (as i did in my pr).

I could see trimming steps being useful (especially for large ~300 repo capsule syncs). I think with that approach you'd want to only trim steps after some amount of time. Even restricting to only successful tasks might make support/debugging difficult.

Actions

Also available in: Atom PDF