Skip to content

Add task metadata to /apps endpoint #118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Sep 19, 2017
Merged

Conversation

philipnrmn
Copy link
Collaborator

This PR is the successor to #114 . It adds the same data - task ID and task name - to the
/metrics/v0/containers/<container-id>/app endpoint as #114 added to the /metrics/v0/containers/<container-id> endpoint.

It had to be separated because the two collectors don't share data. Since task IDs are derived from the agent state, adding them to the agent collector was not especially complicated. Adding them to the framework collector was a bit trickier.

In order to do so, we add a mapping between container IDs and references to task info (ContainerTaskRels), held on the agent collector. From dcos-metrics.go where both are initiated, we then pass a reference to the ContainerTaskRels object into the framework collector. When the framework collector receives a container_id tag, it looks up the associated task metadata and adds it to the metrics message.

We will reference TaskInfo in a map of container IDs to task info in a
later commit.
ContainerTaskRels has a single member, a map of container IDs to
TaskInfo references. It has a getter and a setter for easy access
threadsafe access to the relationship map, and an update method which
populates it from agentState.
The new ContainerTaskRels argument providers access to the task metadata
for each container.
We add a new mock object with appropriate metadata and supply it to
avroRecord.extract(), checking that the task IDs in the
ContainerTaskRels are the same ones that correspond to the container
ID.
This is orthogonal to the intent of this PR, however it's a small change
and seemed like a quick win.
This reference can be provided to the extract method from the framework
collector.
We update the task mapping after every new fetch from agent state.
Now the framework collector has access to the ContainerTaskRels object
which belongs to the mesos agent collector.
Copy link
Contributor

@collinvandyck collinvandyck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get a weird feeling from reading this PR for a couple of reasons

First, I feel like ContainerTaskRels should be built/instantiated outside of the agent collector and passed in to both the constructors for the agent collector and the framework collector. I can't really put my feeling on it, but it is weird to me to ask for the agent collector's "rels" variable and then give it to the framework collector.

Second, it seems like there's a race condition here. Because the agent collector and framework collector run independently of each other, we can't guarantee that the agent collector runs first before the framework collector does. This suggests that we will get a partitioned time series from time to time. For example:

  • framework collector runs, no data in the rels yet
  • framework collector produces a time series with blank TaskId and a blank TaskName
  • agent collector runs
  • framework collector runs, now there is data in the rels
  • framework collector produces a new time series for what is essentially the same metric, because it now has a non-empty TaskID and a non-empty TaskName.

I'm not sure how big of a deal it is, but it seems like it could potentially double the number of unique time series coming out of the framework collector. I wonder if it makes sense to not emit a time series from the framework collector if no taskID or taskName could be found?

@@ -53,6 +54,53 @@ type Collector struct {
metricsChan chan producers.MetricsMessage
nodeInfo collectors.NodeInfo
timestamp int64

// ContainerTaskRels defines the relationship between containers and tasks.
type ContainerTaskRels struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably put the typedef and the associated methods into a new file container_task_rels.go

// Get is a utility method which handles the mutex lock and abstracts the inner
// map in ContainerTaskRels away. If no task info is available for the supplied
// key, returns nil.
func (ctr *ContainerTaskRels) Get(key string) *TaskInfo {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is key a containerID? If so i'd probably rename this to containerID

func (ctr *ContainerTaskRels) Get(key string) *TaskInfo {
ctr.Lock()
defer ctr.Unlock()
if ctr.rels != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of worrying about whether or not rels had been initialized, I think a constructor method for the type would be a better way of creating instances of this type.

func NewContainerTaskRels() *ContainerTaskResls {
    return &ContainerTaskRels{rels: make(map[string]*TaskInfo)}
}

ctr.rels = map[string]*TaskInfo{}
}
ctr.rels[key] = info
ctr.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before you used a defer ctr.Unlock(). Since the lock/unlock in this method happens at the method boundaries maybe it makes more sense to do it that way here for clarity?

for _, e := range f.Executors {
for _, t := range e.Tasks {
for _, s := range t.Statuses {
rels[s.ContainerStatusInfo.ID.Value] = &t
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a bug since you're taking the address of a range variable. The golang range construct only allocates the memory once to hold the value, and each iteration overwrites the same location in memory.

So let's assume that you have 2 values in e.Tasks...

The first iteration over e.Tasks, go will write the Task into t which has memory address M
The second iteration over e.Tasks, go will write the Task into t which also has the same memory address M.

So when you have multiple tasks, you're taking the address of the same memory location.

What you probably want to do here is force a copy of the variable:

copy := t
rels[s.ContainerStatusInfo.ID.Value] = &copy

It should be easy to write a unit test which verifies this behavior (with multiple e.Tasks).

You can see this in action at https://play.golang.org/p/s7zGgU_Nz6

Alternatively, if it is acceptable and would not impact other things, you could have rels instead be a map[string]TaskInfo instead of a map[string]*TaskInfo, which would force the copy of the range variable and avoid this scenario.

This is a sharp edge of Go that i've skinned my knees on more times than I would like to count 😬

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I absolutely follow the logic of what you're saying, but I'm a little puzzled because I can't force this bug to emerge. I am running four identical applications on the same node, each called emitter-N where N is in the range 1-4.

It seems to me that if this bug exists, the metrics results for all four containers should have the same name and task ID. But their task IDs remain distinct, correct and consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the loop goes first over executors, and then tasks, this bug will not appear if there is only ever one task per executor, I think. Is it possible that an executor can ever have more than one task?

@@ -288,6 +288,37 @@ func TestGetAgentState(t *testing.T) {
})
}

func TestUpdateContainerRels(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want to add multiple tasks to the mockAgentState here to help prevent a regression of the range variable address bug (if it is a bug :) )


// Set adds or updates an entry to ContainerTaskRels and, if necessary,
// initiates the inner map. It is only currently used in tests.
func (ctr *ContainerTaskRels) Set(key string, info *TaskInfo) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would rename key in this case to containerID for readability.

@philipnrmn
Copy link
Collaborator Author

Hi @collinvandyck, thanks for all the feedback!

I just wanted to address this bit from your comment above:

I wonder if it makes sense to not emit a time series from the framework collector if no taskID or taskName could be found?

Unfortunately that is not possible, because some containers don't have associated tasks. It's not possible to know when dealing with a container ID in isolation whether you expect an associated task. You could simply not return metrics at all if agentState is not populated, but that doesn't fix the problem either - if a freshly launched container emits metrics, we will need to offer them in the API before we know the associated task ID, if one even exists.

I don't see any way around this unfortunately.

@collinvandyck
Copy link
Contributor

@philipnrmn perhaps we should unite the agent and framework polling mechanism so that the agent always polls first and then, once successful, the framework gets polled next, supplying data from the former to the latter?

@philipnrmn
Copy link
Collaborator Author

@collinvandyck that is unfortunately also not possible; the framework collector does not poll - it receives data from apps emitting statsd messages over UDP (via the mesos module). I don't believe it is possible to force consistency.

In scripts/test.sh we naively run go get to update golint. A new rule
has just been added to golint (golang/lint#319)
and I don't want to complicate my current PR (#118).

I will revert this commit, fix any outstanding nil-returns and consider
pinning golint in a future PR.
Rather than initiating ContainerTaskRels in the framework collector, we
now initiate it in dcos-metrics.go and pass a reference into each
collector instead.
@philipnrmn
Copy link
Collaborator Author

Hey @collinvandyck , I addressed all your feedback. Would we be OK to merge?

@@ -95,7 +95,10 @@ func (ctr *ContainerTaskRels) update(as agentState) {
for _, e := range f.Executors {
for _, t := range e.Tasks {
for _, s := range t.Statuses {
rels[s.ContainerStatusInfo.ID.Value] = &t
rels[s.ContainerStatusInfo.ID.Value] = &TaskInfo{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

func (ctr *ContainerTaskRels) Get(containerID string) *TaskInfo {
ctr.Lock()
defer ctr.Unlock()
if ctr.rels != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this anymore if you're already constructing the type with a non-nil map, here and otherwise?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I guess not. Just disguising my laziness behind a front of belt-and-braces.

@collinvandyck
Copy link
Contributor

Hi @philipnrmn yeah, good to go. one or two minor comments here an there.

@philipnrmn
Copy link
Collaborator Author

aaaaand... fixed ✨

@philipnrmn philipnrmn merged commit e4a9c2f into master Sep 19, 2017
@philipnrmn philipnrmn deleted the philipnrmn/app-task-ids branch September 19, 2017 03:07
philipnrmn added a commit that referenced this pull request Sep 19, 2017
* Export TaskInfo from the mesos agent collector

We will reference TaskInfo in a map of container IDs to task info in a
later commit.

* Add a struct for container->task relationships

ContainerTaskRels has a single member, a map of container IDs to
TaskInfo references. It has a getter and a setter for easy access
threadsafe access to the relationship map, and an update method which
populates it from agentState.

* Return task_id and task_name in /app output

The new ContainerTaskRels argument providers access to the task metadata
for each container.

* Add tests for extracting a record with task data

We add a new mock object with appropriate metadata and supply it to
avroRecord.extract(), checking that the task IDs in the
ContainerTaskRels are the same ones that correspond to the container
ID.

* Add tests for executor ID and framework ID

This is orthogonal to the intent of this PR, however it's a small change
and seemed like a quick win.

* Refer to ContainerTaskRels in Framework collector

This reference can be provided to the extract method from the framework
collector.

* Pass ref to ContainerTaskRels into transform

* Update task mapping from agent collector

We update the task mapping after every new fetch from agent state.

* Pass reference to mapping into framework collector

Now the framework collector has access to the ContainerTaskRels object
which belongs to the mesos agent collector.

* Clarify arg name

* Defer unlock for consistency

* Temporarily skip golint in build

In scripts/test.sh we naively run go get to update golint. A new rule
has just been added to golint (golang/lint#319)
and I don't want to complicate my current PR (#118).

I will revert this commit, fix any outstanding nil-returns and consider
pinning golint in a future PR.

* Create new TaskInfo object to avoid memory issues

* Create ContainerTaskRels in dcos-metrics.go

Rather than initiating ContainerTaskRels in the framework collector, we
now initiate it in dcos-metrics.go and pass a reference into each
collector instead.

* Initiate ContainerTaskRels with a utility method

* Remove unnecessary nil check
philipnrmn added a commit that referenced this pull request Sep 19, 2017
* Export TaskInfo from the mesos agent collector

We will reference TaskInfo in a map of container IDs to task info in a
later commit.

* Add a struct for container->task relationships

ContainerTaskRels has a single member, a map of container IDs to
TaskInfo references. It has a getter and a setter for easy access
threadsafe access to the relationship map, and an update method which
populates it from agentState.

* Return task_id and task_name in /app output

The new ContainerTaskRels argument providers access to the task metadata
for each container.

* Add tests for extracting a record with task data

We add a new mock object with appropriate metadata and supply it to
avroRecord.extract(), checking that the task IDs in the
ContainerTaskRels are the same ones that correspond to the container
ID.

* Add tests for executor ID and framework ID

This is orthogonal to the intent of this PR, however it's a small change
and seemed like a quick win.

* Refer to ContainerTaskRels in Framework collector

This reference can be provided to the extract method from the framework
collector.

* Pass ref to ContainerTaskRels into transform

* Update task mapping from agent collector

We update the task mapping after every new fetch from agent state.

* Pass reference to mapping into framework collector

Now the framework collector has access to the ContainerTaskRels object
which belongs to the mesos agent collector.

* Clarify arg name

* Defer unlock for consistency

* Temporarily skip golint in build

In scripts/test.sh we naively run go get to update golint. A new rule
has just been added to golint (golang/lint#319)
and I don't want to complicate my current PR (#118).

I will revert this commit, fix any outstanding nil-returns and consider
pinning golint in a future PR.

* Create new TaskInfo object to avoid memory issues

* Create ContainerTaskRels in dcos-metrics.go

Rather than initiating ContainerTaskRels in the framework collector, we
now initiate it in dcos-metrics.go and pass a reference into each
collector instead.

* Initiate ContainerTaskRels with a utility method

* Remove unnecessary nil check
philipnrmn added a commit that referenced this pull request Sep 19, 2017
* Export TaskInfo from the mesos agent collector

We will reference TaskInfo in a map of container IDs to task info in a
later commit.

* Add a struct for container->task relationships

ContainerTaskRels has a single member, a map of container IDs to
TaskInfo references. It has a getter and a setter for easy access
threadsafe access to the relationship map, and an update method which
populates it from agentState.

* Return task_id and task_name in /app output

The new ContainerTaskRels argument providers access to the task metadata
for each container.

* Add tests for extracting a record with task data

We add a new mock object with appropriate metadata and supply it to
avroRecord.extract(), checking that the task IDs in the
ContainerTaskRels are the same ones that correspond to the container
ID.

* Add tests for executor ID and framework ID

This is orthogonal to the intent of this PR, however it's a small change
and seemed like a quick win.

* Refer to ContainerTaskRels in Framework collector

This reference can be provided to the extract method from the framework
collector.

* Pass ref to ContainerTaskRels into transform

* Update task mapping from agent collector

We update the task mapping after every new fetch from agent state.

* Pass reference to mapping into framework collector

Now the framework collector has access to the ContainerTaskRels object
which belongs to the mesos agent collector.

* Clarify arg name

* Defer unlock for consistency

* Temporarily skip golint in build

In scripts/test.sh we naively run go get to update golint. A new rule
has just been added to golint (golang/lint#319)
and I don't want to complicate my current PR (#118).

I will revert this commit, fix any outstanding nil-returns and consider
pinning golint in a future PR.

* Create new TaskInfo object to avoid memory issues

* Create ContainerTaskRels in dcos-metrics.go

Rather than initiating ContainerTaskRels in the framework collector, we
now initiate it in dcos-metrics.go and pass a reference into each
collector instead.

* Initiate ContainerTaskRels with a utility method

* Remove unnecessary nil check
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants