Add task metadata to /apps endpoint #118

philipnrmn · 2017-09-13T00:01:35Z

This PR is the successor to #114 . It adds the same data - task ID and task name - to the
/metrics/v0/containers/<container-id>/app endpoint as #114 added to the /metrics/v0/containers/<container-id> endpoint.

It had to be separated because the two collectors don't share data. Since task IDs are derived from the agent state, adding them to the agent collector was not especially complicated. Adding them to the framework collector was a bit trickier.

In order to do so, we add a mapping between container IDs and references to task info (ContainerTaskRels), held on the agent collector. From dcos-metrics.go where both are initiated, we then pass a reference to the ContainerTaskRels object into the framework collector. When the framework collector receives a container_id tag, it looks up the associated task metadata and adds it to the metrics message.

We will reference TaskInfo in a map of container IDs to task info in a later commit.

ContainerTaskRels has a single member, a map of container IDs to TaskInfo references. It has a getter and a setter for easy access threadsafe access to the relationship map, and an update method which populates it from agentState.

The new ContainerTaskRels argument providers access to the task metadata for each container.

We add a new mock object with appropriate metadata and supply it to avroRecord.extract(), checking that the task IDs in the ContainerTaskRels are the same ones that correspond to the container ID.

This is orthogonal to the intent of this PR, however it's a small change and seemed like a quick win.

This reference can be provided to the extract method from the framework collector.

We update the task mapping after every new fetch from agent state.

Now the framework collector has access to the ContainerTaskRels object which belongs to the mesos agent collector.

collinvandyck

I get a weird feeling from reading this PR for a couple of reasons

First, I feel like ContainerTaskRels should be built/instantiated outside of the agent collector and passed in to both the constructors for the agent collector and the framework collector. I can't really put my feeling on it, but it is weird to me to ask for the agent collector's "rels" variable and then give it to the framework collector.

Second, it seems like there's a race condition here. Because the agent collector and framework collector run independently of each other, we can't guarantee that the agent collector runs first before the framework collector does. This suggests that we will get a partitioned time series from time to time. For example:

framework collector runs, no data in the rels yet
framework collector produces a time series with blank TaskId and a blank TaskName
agent collector runs
framework collector runs, now there is data in the rels
framework collector produces a new time series for what is essentially the same metric, because it now has a non-empty TaskID and a non-empty TaskName.

I'm not sure how big of a deal it is, but it seems like it could potentially double the number of unique time series coming out of the framework collector. I wonder if it makes sense to not emit a time series from the framework collector if no taskID or taskName could be found?

collinvandyck · 2017-09-13T21:34:01Z

collectors/mesos/agent/agent.go

@@ -53,6 +54,53 @@ type Collector struct {
 	metricsChan chan producers.MetricsMessage
 	nodeInfo    collectors.NodeInfo
 	timestamp   int64
+
+// ContainerTaskRels defines the relationship between containers and tasks.
+type ContainerTaskRels struct {


I'd probably put the typedef and the associated methods into a new file container_task_rels.go

collinvandyck · 2017-09-13T21:34:35Z

collectors/mesos/agent/agent.go

+// Get is a utility method which handles the mutex lock and abstracts the inner
+// map in ContainerTaskRels away. If no task info is available for the supplied
+// key, returns nil.
+func (ctr *ContainerTaskRels) Get(key string) *TaskInfo {


Is key a containerID? If so i'd probably rename this to containerID

collinvandyck · 2017-09-13T21:37:00Z

collectors/mesos/agent/agent.go

+func (ctr *ContainerTaskRels) Get(key string) *TaskInfo {
+	ctr.Lock()
+	defer ctr.Unlock()
+	if ctr.rels != nil {


Instead of worrying about whether or not rels had been initialized, I think a constructor method for the type would be a better way of creating instances of this type.

func NewContainerTaskRels() *ContainerTaskResls { return &ContainerTaskRels{rels: make(map[string]*TaskInfo)} }

collinvandyck · 2017-09-13T21:39:24Z

collectors/mesos/agent/agent.go

+		ctr.rels = map[string]*TaskInfo{}
+	}
+	ctr.rels[key] = info
+	ctr.Unlock()


Before you used a defer ctr.Unlock(). Since the lock/unlock in this method happens at the method boundaries maybe it makes more sense to do it that way here for clarity?

collinvandyck · 2017-09-13T21:47:06Z

collectors/mesos/agent/agent.go

+		for _, e := range f.Executors {
+			for _, t := range e.Tasks {
+				for _, s := range t.Statuses {
+					rels[s.ContainerStatusInfo.ID.Value] = &t


This looks like a bug since you're taking the address of a range variable. The golang range construct only allocates the memory once to hold the value, and each iteration overwrites the same location in memory.

So let's assume that you have 2 values in e.Tasks...

The first iteration over e.Tasks, go will write the Task into t which has memory address M
The second iteration over e.Tasks, go will write the Task into t which also has the same memory address M.

So when you have multiple tasks, you're taking the address of the same memory location.

What you probably want to do here is force a copy of the variable:

copy := t rels[s.ContainerStatusInfo.ID.Value] = &copy

It should be easy to write a unit test which verifies this behavior (with multiple e.Tasks).

You can see this in action at https://play.golang.org/p/s7zGgU_Nz6

Alternatively, if it is acceptable and would not impact other things, you could have rels instead be a map[string]TaskInfo instead of a map[string]*TaskInfo, which would force the copy of the range variable and avoid this scenario.

This is a sharp edge of Go that i've skinned my knees on more times than I would like to count 😬

I absolutely follow the logic of what you're saying, but I'm a little puzzled because I can't force this bug to emerge. I am running four identical applications on the same node, each called emitter-N where N is in the range 1-4.

It seems to me that if this bug exists, the metrics results for all four containers should have the same name and task ID. But their task IDs remain distinct, correct and consistent.

Since the loop goes first over executors, and then tasks, this bug will not appear if there is only ever one task per executor, I think. Is it possible that an executor can ever have more than one task?

collinvandyck · 2017-09-13T22:07:38Z

collectors/mesos/agent/agent_test.go

@@ -288,6 +288,37 @@ func TestGetAgentState(t *testing.T) {
 	})
 }

+func TestUpdateContainerRels(t *testing.T) {


Probably want to add multiple tasks to the mockAgentState here to help prevent a regression of the range variable address bug (if it is a bug :) )

collinvandyck · 2017-09-13T22:10:28Z

collectors/mesos/agent/agent.go

+
+// Set adds or updates an entry to ContainerTaskRels and, if necessary,
+// initiates the inner map. It is only currently used in tests.
+func (ctr *ContainerTaskRels) Set(key string, info *TaskInfo) {


Would rename key in this case to containerID for readability.

philipnrmn · 2017-09-14T17:30:35Z

Hi @collinvandyck, thanks for all the feedback!

I just wanted to address this bit from your comment above:

I wonder if it makes sense to not emit a time series from the framework collector if no taskID or taskName could be found?

Unfortunately that is not possible, because some containers don't have associated tasks. It's not possible to know when dealing with a container ID in isolation whether you expect an associated task. You could simply not return metrics at all if agentState is not populated, but that doesn't fix the problem either - if a freshly launched container emits metrics, we will need to offer them in the API before we know the associated task ID, if one even exists.

I don't see any way around this unfortunately.

collinvandyck · 2017-09-14T17:32:30Z

@philipnrmn perhaps we should unite the agent and framework polling mechanism so that the agent always polls first and then, once successful, the framework gets polled next, supplying data from the former to the latter?

philipnrmn · 2017-09-15T18:05:54Z

@collinvandyck that is unfortunately also not possible; the framework collector does not poll - it receives data from apps emitting statsd messages over UDP (via the mesos module). I don't believe it is possible to force consistency.

In scripts/test.sh we naively run go get to update golint. A new rule has just been added to golint (golang/lint#319) and I don't want to complicate my current PR (#118). I will revert this commit, fix any outstanding nil-returns and consider pinning golint in a future PR.

Rather than initiating ContainerTaskRels in the framework collector, we now initiate it in dcos-metrics.go and pass a reference into each collector instead.

philipnrmn · 2017-09-19T01:43:41Z

Hey @collinvandyck , I addressed all your feedback. Would we be OK to merge?

collinvandyck · 2017-09-19T01:50:44Z

collectors/mesos/agent/agent.go

@@ -95,7 +95,10 @@ func (ctr *ContainerTaskRels) update(as agentState) {
 		for _, e := range f.Executors {
 			for _, t := range e.Tasks {
 				for _, s := range t.Statuses {
-					rels[s.ContainerStatusInfo.ID.Value] = &t
+					rels[s.ContainerStatusInfo.ID.Value] = &TaskInfo{


collinvandyck · 2017-09-19T01:52:49Z

collectors/mesos/agent/agent.go

+func (ctr *ContainerTaskRels) Get(containerID string) *TaskInfo {
+	ctr.Lock()
+	defer ctr.Unlock()
+	if ctr.rels != nil {


Do you need this anymore if you're already constructing the type with a non-nil map, here and otherwise?

No, I guess not. Just disguising my laziness behind a front of belt-and-braces.

collinvandyck · 2017-09-19T01:53:41Z

Hi @philipnrmn yeah, good to go. one or two minor comments here an there.

philipnrmn · 2017-09-19T03:03:10Z

aaaaand... fixed ✨

* Export TaskInfo from the mesos agent collector We will reference TaskInfo in a map of container IDs to task info in a later commit. * Add a struct for container->task relationships ContainerTaskRels has a single member, a map of container IDs to TaskInfo references. It has a getter and a setter for easy access threadsafe access to the relationship map, and an update method which populates it from agentState. * Return task_id and task_name in /app output The new ContainerTaskRels argument providers access to the task metadata for each container. * Add tests for extracting a record with task data We add a new mock object with appropriate metadata and supply it to avroRecord.extract(), checking that the task IDs in the ContainerTaskRels are the same ones that correspond to the container ID. * Add tests for executor ID and framework ID This is orthogonal to the intent of this PR, however it's a small change and seemed like a quick win. * Refer to ContainerTaskRels in Framework collector This reference can be provided to the extract method from the framework collector. * Pass ref to ContainerTaskRels into transform * Update task mapping from agent collector We update the task mapping after every new fetch from agent state. * Pass reference to mapping into framework collector Now the framework collector has access to the ContainerTaskRels object which belongs to the mesos agent collector. * Clarify arg name * Defer unlock for consistency * Temporarily skip golint in build In scripts/test.sh we naively run go get to update golint. A new rule has just been added to golint (golang/lint#319) and I don't want to complicate my current PR (#118). I will revert this commit, fix any outstanding nil-returns and consider pinning golint in a future PR. * Create new TaskInfo object to avoid memory issues * Create ContainerTaskRels in dcos-metrics.go Rather than initiating ContainerTaskRels in the framework collector, we now initiate it in dcos-metrics.go and pass a reference into each collector instead. * Initiate ContainerTaskRels with a utility method * Remove unnecessary nil check

philipnrmn added 9 commits September 12, 2017 14:29

Export TaskInfo from the mesos agent collector

6c39bd8

We will reference TaskInfo in a map of container IDs to task info in a later commit.

Add a struct for container->task relationships

a9a892d

ContainerTaskRels has a single member, a map of container IDs to TaskInfo references. It has a getter and a setter for easy access threadsafe access to the relationship map, and an update method which populates it from agentState.

Return task_id and task_name in /app output

19e4f49

The new ContainerTaskRels argument providers access to the task metadata for each container.

Add tests for extracting a record with task data

6f60ec4

We add a new mock object with appropriate metadata and supply it to avroRecord.extract(), checking that the task IDs in the ContainerTaskRels are the same ones that correspond to the container ID.

Add tests for executor ID and framework ID

d47ab24

This is orthogonal to the intent of this PR, however it's a small change and seemed like a quick win.

Refer to ContainerTaskRels in Framework collector

344f903

This reference can be provided to the extract method from the framework collector.

Pass ref to ContainerTaskRels into transform

5dc36aa

Update task mapping from agent collector

8d983ca

We update the task mapping after every new fetch from agent state.

Pass reference to mapping into framework collector

9f112f9

Now the framework collector has access to the ContainerTaskRels object which belongs to the mesos agent collector.

philipnrmn added the ready for review label Sep 13, 2017

philipnrmn self-assigned this Sep 13, 2017

philipnrmn requested a review from collinvandyck September 13, 2017 00:01

collinvandyck suggested changes Sep 14, 2017

View reviewed changes

collinvandyck approved these changes Sep 18, 2017

View reviewed changes

philipnrmn added 6 commits September 18, 2017 16:42

Clarify arg name

fd162c5

Defer unlock for consistency

4b26f78

Create new TaskInfo object to avoid memory issues

ca04983

Create ContainerTaskRels in dcos-metrics.go

7960767

Rather than initiating ContainerTaskRels in the framework collector, we now initiate it in dcos-metrics.go and pass a reference into each collector instead.

Initiate ContainerTaskRels with a utility method

abde904

collinvandyck reviewed Sep 19, 2017

View reviewed changes

Remove unnecessary nil check

178e5ca

philipnrmn merged commit e4a9c2f into master Sep 19, 2017

philipnrmn deleted the philipnrmn/app-task-ids branch September 19, 2017 03:07

This was referenced Sep 22, 2017

Reinstate golint #120

Merged

Add utility method for duplicate test setup code #121

Merged

Add task metadata to /apps endpoint #118

Add task metadata to /apps endpoint #118

Conversation

philipnrmn commented Sep 13, 2017

Uh oh!

collinvandyck left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

philipnrmn commented Sep 14, 2017

Uh oh!

collinvandyck commented Sep 14, 2017

Uh oh!

philipnrmn commented Sep 15, 2017

Uh oh!

philipnrmn commented Sep 19, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

collinvandyck commented Sep 19, 2017

Uh oh!

philipnrmn commented Sep 19, 2017

Uh oh!

Uh oh!