1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
|
# Libgrid Dogfood Plan
## Why Libgrid
`libgrid` is the right first serious dogfood target because it has exactly the
failure mode Fidget Spinner is designed to kill:
- long autonomous optimization loops
- heavy benchmark slicing
- worktree churn
- huge markdown logs that blur intervention, result, and verdict
That is the proving ground.
## Immediate Goal
The goal is not “ingest every scrap of prose.”
The goal is to replace the giant freeform experiment log with a machine in
which the active frontier, live hypotheses, current experiments, verdicts, and
best benchmark lines are explicit and queryable.
## Mapping Libgrid Work Into The Model
### Frontier
One optimization objective becomes one frontier:
- root cash-out
- LP spend reduction
- primal improvement
- search throughput
- cut pipeline quality
The frontier brief should answer where the campaign stands right now, not dump
historical narrative.
### Hypothesis
A hypothesis should capture one concrete intervention claim:
- terse title
- one-line summary
- one-paragraph body
If the body wants to become a design memo, it is too large.
### Experiment
Each measured slice becomes one experiment under exactly one hypothesis.
The experiment closes with:
- dimensions such as `instance`, `profile`, `duration_s`
- primary metric
- supporting metrics
- verdict: `accepted | kept | parked | rejected`
- rationale
- optional analysis
If a tranche doc reports multiple benchmark slices, it should become multiple
experiments, not one prose blob.
### Artifact
Historical markdown, logs, tables, and other large dumps should be attached as
artifacts by reference when they matter. They should not live in the ledger as
default-enumerated prose.
## Libgrid Workflow
### 1. Ground
1. Bind the MCP to the libgrid worktree.
2. Read `frontier.open`.
3. Decide whether the next move is a new hypothesis, a new experiment on an
existing hypothesis, or a frontier brief update.
### 2. Start a line of attack
1. Record a hypothesis.
2. Attach any necessary artifacts by reference.
3. Open one experiment for the concrete slice being tested.
### 3. Execute
1. Modify the worktree.
2. Run the benchmark protocol.
3. Close the experiment atomically with parsed metrics and an explicit verdict.
### 4. Judge and continue
1. Use `accepted`, `kept`, `parked`, and `rejected` honestly.
2. Let the frontier brief summarize the current strategic state.
3. Let historical tranche markdown live as artifacts when preservation matters.
## Benchmark Discipline
For `libgrid`, the minimum trustworthy record is:
- run dimensions
- primary metric
- supporting metrics that materially explain the verdict
- rationale
This is the minimum needed to prevent “I think this was faster” folklore.
## Active Metric Discipline
`libgrid` will accumulate many niche metrics.
The hot path should care about live metrics only: the metrics touched by the
active experimental frontier and its immediate comparison set. Old, situational
metrics may remain in the registry without dominating `frontier.open`.
## Acceptance Bar
Fidget Spinner is ready for serious `libgrid` use when:
- an agent can run for hours without generating a markdown graveyard
- `frontier.open` gives a truthful, bounded orientation surface
- active hypotheses and open experiments are obvious
- closed experiments carry parsed metrics rather than prose-only results
- artifacts preserve source texture without flooding the hot path
- the system feels like a machine for evidence rather than a diary with better
typography
|