New design for logs involving tool calls (and maybe tool classes) #1080

simonw · 2025-05-25T00:09:04Z

I am going to design the log output next, since that can inform the database schema.

Originally posted by @simonw in #1059

simonw · 2025-05-25T00:09:15Z

The list of all tools made available to a model is important context, because even the fact that a model did not chose to execute a tool is important information.

Log output should emphasize actual tool executions more than it does at the moment though.

Also: right now a sequence of prompt->tool->tool-response->tool->tool-response-and-reply is represented as 3 full prompt-and-responses in the logs, even though the user only said one thing and got one response at the end. Maybe the header levels should adjust for that to help represent some responses as automated parts of that flow?

simonw · 2025-05-25T00:10:59Z

Just running these two commands:

llm -T simple_eval '1234 * 123415'
llm -c '* 33'

Produced the following sequence of _four* logs entries, grouped under one conversation. I added <hr> just to help clarify where each one starts and ends:

2025-05-24T23:50:38 conversation: 01jw2b9acve2hrhy4nsxn30b8a id: 01jw2b9aczcwxhdjk4jha4757m

Model: gpt-4.1-mini

Prompt

1234 * 123415

Tools

simple_eval: 64713bc9bfec5e5a6239b9145943c577f7781b81e2e7e129b5c1e665074926f9

Evaluate a simple expression using the simpleeval library.

Arguments: {"expression": {"type": "string"}}

Response

Tool calls

simple_eval: call_PxGKU2Z1csgz4ZRSU9xldYmB

Arguments: {"expression": "1234 * 123415"}

2025-05-24T23:50:39

Prompt

-- none --

Tools

simple_eval: 64713bc9bfec5e5a6239b9145943c577f7781b81e2e7e129b5c1e665074926f9

Evaluate a simple expression using the simpleeval library.

Arguments: {"expression": {"type": "string"}}

Tool results

simple_eval: call_PxGKU2Z1csgz4ZRSU9xldYmB

152294110

Response

1234 multiplied by 123415 is 152,294,110.

2025-05-24T23:50:46

Prompt

33

Tools

simple_eval: 64713bc9bfec5e5a6239b9145943c577f7781b81e2e7e129b5c1e665074926f9

Evaluate a simple expression using the simpleeval library.

Arguments: {"expression": {"type": "string"}}

Response

Tool calls

simple_eval: call_67KH73vmC577yInLq4eZLATc

Arguments: {"expression": "152294110 * 33"}

2025-05-24T23:50:47

Prompt

-- none --

Tools

simple_eval: 64713bc9bfec5e5a6239b9145943c577f7781b81e2e7e129b5c1e665074926f9

Evaluate a simple expression using the simpleeval library.

Arguments: {"expression": {"type": "string"}}

Tool results

simple_eval: call_67KH73vmC577yInLq4eZLATc

5025705630

Response

152,294,110 multiplied by 33 is 5,025,705,630.

simonw · 2025-05-25T00:15:00Z

Here's a demo with multiple tool rounds from a single prompt:

llm --functions '
def lookup_population(country: str) -> int:
    "Returns the current population of the specified fictional country"
    return 123124

def can_have_dragons(population: int) -> bool:
    "Returns True if the specified population can have dragons, False otherwise"
    return population > 10000

' 'Can the country of Manganolia have dragons?' --td

Output:

Tool call: lookup_population({'country': 'Manganolia'})
  123124
Tool call: can_have_dragons({'population': 123124})
  true
Yes, the country of Manganolia can have dragons.

logs -c

With added <hr> - we get three logged responses from a single input prompt:

2025-05-25T00:13:35 conversation: 01jw2ckbm1s2wrm9p6a08fgk1g id: 01jw2ckbm5y50fbvc0szhfkb2r

Model: gpt-4.1-mini

Prompt

Can the country of Manganolia have dragons?

Tools

lookup_population: 47a2be714160f98686797a434ea1d3551ca55d71848b2aabb85ca866252b9bfc

Returns the current population of the specified fictional country

Arguments: {"country": {"type": "string"}}

can_have_dragons: 41f175bd12e2497880157bd68c3e95597ae333e30fdda61a1c1837ed5436c9a3

Returns True if the specified population can have dragons, False otherwise

Arguments: {"population": {"type": "integer"}}

Response

Tool calls

lookup_population: call_ZG9XihHfxeAi1iwaFpqhi4dq

Arguments: {"country": "Manganolia"}

2025-05-25T00:13:36

Prompt

-- none --

Tools

lookup_population: 47a2be714160f98686797a434ea1d3551ca55d71848b2aabb85ca866252b9bfc

Returns the current population of the specified fictional country

Arguments: {"country": {"type": "string"}}

can_have_dragons: 41f175bd12e2497880157bd68c3e95597ae333e30fdda61a1c1837ed5436c9a3

Returns True if the specified population can have dragons, False otherwise

Arguments: {"population": {"type": "integer"}}

Tool results

lookup_population: call_ZG9XihHfxeAi1iwaFpqhi4dq

123124

Response

Tool calls

can_have_dragons: call_kCSmi2PbQTuaQdUYoDrDtY6W

Arguments: {"population": 123124}

2025-05-25T00:13:37

Prompt

-- none --

Tools

lookup_population: 47a2be714160f98686797a434ea1d3551ca55d71848b2aabb85ca866252b9bfc

Returns the current population of the specified fictional country

Arguments: {"country": {"type": "string"}}

can_have_dragons: 41f175bd12e2497880157bd68c3e95597ae333e30fdda61a1c1837ed5436c9a3

Returns True if the specified population can have dragons, False otherwise

Arguments: {"population": {"type": "integer"}}

Tool results

can_have_dragons: call_kCSmi2PbQTuaQdUYoDrDtY6W

true

Response

Yes, the country of Manganolia can have dragons.

simonw · 2025-05-25T00:21:50Z

Here's a neater design for the logs without getting into the tool classes thing:

conversation: 01jw2ckbm1s2wrm9p6a08fgk1g

Model: gpt-4.1-mini

Tools:

lookup_population: 47a2be714160f98686797a434ea1d3551ca55d71848b2aabb85ca866252b9bfc

Returns the current population of the specified fictional country

Arguments: {"country": {"type": "string"}}

can_have_dragons: 41f175bd12e2497880157bd68c3e95597ae333e30fdda61a1c1837ed5436c9a3

Returns True if the specified population can have dragons, False otherwise

Arguments: {"population": {"type": "integer"}}

2025-05-25T00:13:35 01jw2ckbm5y50fbvc0szhfkb2r

Prompt

Can the country of Manganolia have dragons?

Tool calls

lookup_population: call_ZG9XihHfxeAi1iwaFpqhi4dq

Arguments: {"country": "Manganolia"}

Token usage

86 input, 16 output

2025-05-25T00:13:36 01jw2ckce4ecq4zvhssyrvtgz7

Tool results

lookup_population: call_ZG9XihHfxeAi1iwaFpqhi4dq

123124

Tool calls

can_have_dragons: call_kCSmi2PbQTuaQdUYoDrDtY6W

Arguments: {"population": 123124}

Token usage

112 input, 17 output

2025-05-25T00:13:37 01jw2ckbm5y50fbvc0szhfkb2r

Tool results

can_have_dragons: call_kCSmi2PbQTuaQdUYoDrDtY6W

true

Response

Yes, the country of Manganolia can have dragons.

Token usage

140 input, 13 output

simonw · 2025-05-25T00:30:37Z

That example assumes tools stay constant throughout the conversation. I want to let users use --tool X to change the tools available half way through. Not entirely sure if this should be additive to the original tools or if it should replace them.

I think replacing makes more sense: keep people in full control of what tools are available. Better for if you want to avoid accidental data exfiltration attacks - although it's not safe to hit potentially malicious tokens either before OR after obtaining private data so maybe that's completely irrelevant here.

simonw · 2025-05-25T00:40:08Z

... but OK, if I assume that new design what could it look like with class-based tools like Playwright?

Maybe this - with a special note for when the Playwright() class is first instantiated:

conversation: 01jw2ckbm1s2wrm9p6a08fgk1g

Model: gpt-4.1-mini

Tools:

Playwright({"show": true}): 47a2be714160f98686797a434ea1d3551ca55d71848b2aabb85ca866252b9bfc

Tools for interacting with web pages using Playwright. Start by calling open_browser(url), then interact with the page using the other tools.

Methods:

open_browser(url: str): Open a browser to the specified URL

accessibility_tree(): Get the accessibility tree of the current page

screenshot() : Take a screenshot of the current page

execute_javascript(code: str): Execute JavaScript code in the context of the current page and return the result

simple_eval: 1231232131
Evaluate a simple expression, e.g. 123312 * 123123 / 2313.4

Arguments: {"expression": "string"}

2025-05-25T00:13:35 01jw2ckbm5y50fbvc0szhfkb2r

Prompt

Check what's new on simonwillison.net - times the number of h2s by 2314.

Tool calls

Playwright({"show": true}) - instance 01JW2DTBJ9JTR9S0AMPY9YQNZA

Playwright.open_browser: call_ZG9XihHfxeAi1iwaFpqhi4dq

Arguments: {"url": "https://simonwillison.net/"}

2025-05-25T00:13:36 01jw2ckce4ecq4zvhssyrvtgz7

Tool results

Playwright.open_browser: call_ZG9XihHfxeAi1iwaFpqhi4dq

ready

Tool calls

Playwright.execute_javascript: call_kCSmi2PbQTuaQdUYoDrDtY6W

Arguments: {"code": "document.querySelectorAll('h2').length"}

2025-05-25T00:13:37 01jw2ckbm5y50fbvc0szhfkb2r

Tool results

Playwright.execute_javascript: call_kCSmi2PbQTuaQdUYoDrDtY6W

16

Tool calls

simple_eval: call_kCSmi2PbQTuaQdUYoDrDtY6W

Arguments: {"expression": "16 * 2314"}

2025-05-25T00:13:37 01jw2ckbm5y50fbvc0szhfkb2r

Tool results

simple_eval: call_kCSmi2PbQTuaQdUYoDrDtY6W

37024

Response

The number of <h2> elements on simonwillison.net is 16. Multiplying this by 2314 gives a total of 37024.

simonw · 2025-05-25T00:40:37Z

I haven't thought very hard about what happens if the arguments or output from a tool call are really long.

Or if a tool call returns an attachment, see:

Ability for tools to return attachments #1014

simonw · 2025-05-27T04:03:54Z

I'm bumping this from the stable tools release milestone for lack of time - currently it does this which isn't perfect (no instance information) but is usable:

# 2025-05-27T04:03:06    conversation: 01jw7yh1twd32fyfnndwag9bgh id: 01jw7yh1txvrqga7psej6q3czh

Model: **gpt-4.1-mini**

## Prompt

show tables

### Tools

- **Datasette_query**: `1aa6d35df9b225d03daee33201eab3ae539b67157f3a8ec544b0b5d8eaeeeeb9`<br>
    Execute provided SQLite SQL query - read-only, only use SELECT<br>
    Arguments: {"sqlite_sql": {"type": "string"}}
- **Datasette_schema**: `24ec61558e67e01fe6927f059ab376bad38b8dfd815851e9f3846371ef35d246`<br>
    View the SQLite schema of the attached database<br>
    Arguments: {}

## Response

### Tool calls

- **Datasette_schema**: `call_KaODghD96R8F975bYaTLTzhf`<br>
    Arguments: {}

# 2025-05-27T04:03:07

## Prompt

-- none --

### Tools

- **Datasette_query**: `1aa6d35df9b225d03daee33201eab3ae539b67157f3a8ec544b0b5d8eaeeeeb9`<br>
    Execute provided SQLite SQL query - read-only, only use SELECT<br>
    Arguments: {"sqlite_sql": {"type": "string"}}
- **Datasette_schema**: `24ec61558e67e01fe6927f059ab376bad38b8dfd815851e9f3846371ef35d246`<br>
    View the SQLite schema of the attached database<br>
    Arguments: {}

### Tool results

- **Datasette_schema**: `call_KaODghD96R8F975bYaTLTzhf`<br>
    [{"group_concat(sql, ';')": "CREATE TABLE counters (name text primary key, value integer)"}]

## Response

There is one table in the database named "counters".

simonw added this to the LLM tools non-alpha release (0.27) milestone May 25, 2025

simonw added design tools logging labels May 25, 2025

simonw modified the milestones: LLM tools non-alpha release (0.27), LLM tools v2 May 27, 2025

Uh oh!

New design for logs involving tool calls (and maybe tool classes) #1080

New design for logs involving tool calls (and maybe tool classes) #1080

Comments

simonw commented May 25, 2025

simonw commented May 25, 2025

Uh oh!

simonw commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

2025-05-24T23:50:38 conversation: 01jw2b9acve2hrhy4nsxn30b8a id: 01jw2b9aczcwxhdjk4jha4757m

Prompt

Tools

Response

Tool calls

2025-05-24T23:50:39

Prompt

Tools

Tool results

Response

2025-05-24T23:50:46

Prompt

Tools

Response

Tool calls

2025-05-24T23:50:47

Prompt

Tools

Tool results

Response

Uh oh!

simonw commented May 25, 2025

2025-05-25T00:13:35 conversation: 01jw2ckbm1s2wrm9p6a08fgk1g id: 01jw2ckbm5y50fbvc0szhfkb2r

Prompt

Tools

Response

Tool calls

2025-05-25T00:13:36

Prompt

Tools

Tool results

Response

Tool calls

2025-05-25T00:13:37

Prompt

Tools

Tool results

Response

Uh oh!

simonw commented May 25, 2025

conversation: 01jw2ckbm1s2wrm9p6a08fgk1g

2025-05-25T00:13:35 01jw2ckbm5y50fbvc0szhfkb2r

Prompt

Tool calls

Token usage

2025-05-25T00:13:36 01jw2ckce4ecq4zvhssyrvtgz7

Tool results

Tool calls

Token usage

2025-05-25T00:13:37 01jw2ckbm5y50fbvc0szhfkb2r

Tool results

Response

Token usage

Uh oh!

simonw commented May 25, 2025

Uh oh!

simonw commented May 25, 2025

conversation: 01jw2ckbm1s2wrm9p6a08fgk1g

2025-05-25T00:13:35 01jw2ckbm5y50fbvc0szhfkb2r

Prompt

Tool calls

2025-05-25T00:13:36 01jw2ckce4ecq4zvhssyrvtgz7

Tool results

Tool calls

2025-05-25T00:13:37 01jw2ckbm5y50fbvc0szhfkb2r

Tool results

Tool calls

2025-05-25T00:13:37 01jw2ckbm5y50fbvc0szhfkb2r

Tool results

Response

Uh oh!

simonw commented May 25, 2025 •

edited

Loading

2025-05-25T00:13:35 `01jw2ckbm5y50fbvc0szhfkb2r`

2025-05-25T00:13:36 `01jw2ckce4ecq4zvhssyrvtgz7`

2025-05-25T00:13:37 `01jw2ckbm5y50fbvc0szhfkb2r`

2025-05-25T00:13:35 `01jw2ckbm5y50fbvc0szhfkb2r`

2025-05-25T00:13:36 `01jw2ckce4ecq4zvhssyrvtgz7`

2025-05-25T00:13:37 `01jw2ckbm5y50fbvc0szhfkb2r`

2025-05-25T00:13:37 `01jw2ckbm5y50fbvc0szhfkb2r`

simonw commented May 25, 2025 •

edited

Loading