Skip to main content

Overview

Timepoint Pro generates high-quality training data for fine-tuning language models. Unlike naive prompt/completion pairs, SNAG-generated data includes:
  • Full causal ancestry - Every knowledge item has provenance
  • Quantitative state tensors - Emotional valence, arousal, energy at each turn
  • Temporal consistency - Portal mode strips anachronistic knowledge
  • Counterfactual reasoning - Branching mode shows “what if” alternatives
  • Rich context - M3 knowledge provenance, M6 entity state, M7 causal history, M10 atmosphere, M11 dialog context, M13 relationships
This makes SNAG data uniquely valuable for training:
  • Causal reasoning models
  • Multi-agent roleplay models
  • Temporal reasoning systems
  • Social simulation models

Export Formats

TDF (Timepoint Data Format)

TDF is the canonical interchange format for the Timepoint Suite (Flash, Pro, Clockchain, SNAG-Bench, Proteus). Export via API:
GET /api/data-export/{run_id}
Returns:
{
  "run_id": "run_abc123",
  "entities": [...],
  "dialogs": [...],
  "causal_edges": [...],
  "metadata": {
    "mechanisms": ["M3", "M7", "M11", "M13"],
    "temporal_mode": "portal",
    "cost_usd": 0.42,
    "token_count": 12847
  }
}
TDF package integration: Timepoint Pro uses the canonical timepoint-tdf package:
from timepoint_tdf import from_pro, write_tdf_jsonl

# Convert Pro run to TDF
tdf_data = from_pro(run_id, store)

# Write to JSONL
write_tdf_jsonl(tdf_data, "output.jsonl")
TDF schema:
{
  "@context": "https://timepoint.ai/tdf/v1",
  "@type": "RenderedFuture",
  "run_id": "run_abc123",
  "scenario": "mars_mission_portal",
  "temporal_mode": "portal",
  "entities": [
    {
      "entity_id": "Webb",
      "entity_type": "human",
      "tensor": {...},
      "metadata": {...}
    }
  ],
  "timepoints": [
    {
      "timepoint_id": "T1",
      "timestamp": "2026-03-01T14:00:00Z",
      "entities_present": ["Webb", "Chen"],
      "causal_antecedents": ["T0"]
    }
  ],
  "dialogs": [...],
  "causal_graph": {
    "nodes": [...],
    "edges": [...]
  },
  "provenance": {
    "generated_at": "2026-03-06T12:00:00Z",
    "model": "meta-llama/llama-3.1-70b-instruct",
    "cost_usd": 0.42
  }
}

JSONL (Training Format)

Prompt/completion pairs for fine-tuning: Enable in template:
"outputs": {
  "export_ml_dataset": true
}
Example JSONL record:
{
  "prompt": "[INST] You are Webb, mission commander. Current state: emotional_valence=-0.2, emotional_arousal=0.6, energy_budget=72. You know: ['Mission timeline', 'O2 scrubber threshold 800 ppm', 'Current reading 847 ppm']. Relationships: Chen (colleague, trusted). Recent: Sensor alert 2 hours ago. Generate your next dialog turn. [/INST]",
  "completion": "The reading's at 847. That's 6% over spec. Chen, run a calibration check. If it's still high in 30 minutes, we scrub the EVA.",
  "metadata": {
    "timepoint_id": "T2",
    "speaker": "Webb",
    "archetype": "military_commander",
    "mechanism": "M11",
    "emotional_valence": -0.2,
    "emotional_arousal": 0.6,
    "training_safe": true
  }
}
Sample file: See examples/sample_training_data.jsonl for complete examples from Portal mode simulations.

SQLite Export

Full simulation state in relational format:
# Simulation runs stored in metadata/runs.db
sqlite3 metadata/runs.db

sqlite> SELECT run_id, status, cost_usd, created_at FROM runs;
Tables:
  • runs - Run metadata
  • entities - Entity tensors and metadata
  • timepoints - Temporal structure
  • dialogs - Conversation turns
  • causal_edges - Causal graph structure
  • exposure_events - Knowledge propagation (M3)

Oxen.ai Auto-Upload

Automatic versioned dataset upload:
export OXEN_API_KEY=your_key
./run.sh run mars_mission_portal
# Automatically uploads to Oxen.ai with run metadata
Upload triggers:
  • export_ml_dataset=true in template
  • OXEN_API_KEY environment variable set
  • Run completes successfully
Oxen dataset structure:
timepoint-pro-training-data/
├── runs/
│   ├── run_abc123/
│   │   ├── training_data.jsonl
│   │   ├── tdf_export.json
│   │   └── metadata.json

Model Licensing

CRITICAL: Not all open-source models allow unrestricted use of outputs for training data.

License Matrix

LicenseModelsTraining Data Status
MITDeepSeek Chat, DeepSeek R1✅ Fully unrestricted—outputs can train any model
Apache 2.0Mistral 7B, Mixtral 8x7B, Mixtral 8x22B✅ Fully unrestricted—outputs can train any model
LlamaLlama 3.1 8B/70B/405B, Llama 4 Scout⚠️ Restricted—Meta’s license prohibits using Llama outputs to train non-Llama models
QwenQwen 2.5 7B/72B, QwQ 32B✅ Permissive for most uses

Default Behavior: M18 Filtering

The model selector (M18) automatically filters to training-safe models:
from llm_service.model_selector import ModelSelector, ActionType

selector = ModelSelector()

# Automatically filters to MIT/Apache-2.0 models
model = selector.select_model(
    ActionType.DIALOG_SYNTHESIS,
    for_training_data=True  # Only unrestricted licenses
)
# Returns: "deepseek/deepseek-chat" or "mistralai/mixtral-8x7b-instruct"
When for_training_data=True:
  • Llama models excluded (license restricts training non-Llama models)
  • Only MIT and Apache-2.0 licensed models used
  • Oxen.ai upload uses this filter automatically

Check Training-Safe Models

selector = ModelSelector()
training_safe = selector.get_training_safe_models()

print(training_safe)
# ['deepseek/deepseek-chat', 'deepseek/deepseek-r1', 
#  'mistralai/mixtral-8x7b-instruct', 'mistralai/mixtral-8x22b-instruct']

Explicitly Use Training-Safe Models

In CLI:
# Force training-safe model
./run.sh run --model deepseek/deepseek-r1 mars_mission_portal
In template:
{
  "temporal": {
    "mode": "forward"
  },
  "llm_config": {
    "default_model": "deepseek/deepseek-chat",
    "for_training_data": true
  }
}

License Implications

If using Llama outputs:
  • ✅ Can fine-tune Llama models (same family)
  • ❌ Cannot fine-tune Qwen, Mistral, DeepSeek, or custom models
  • ❌ Cannot upload to public datasets (e.g., Hugging Face)
If using MIT/Apache-2.0 outputs:
  • ✅ Can fine-tune any model
  • ✅ Can upload to public datasets
  • ✅ Can use commercially without restrictions
Recommendation: If you plan to fine-tune non-Llama models or create public datasets, always use:
./run.sh run --model deepseek/deepseek-r1 your_template

Training Data Quality

Why SNAG Data is Superior

Standard training data:
{
  "prompt": "You are a commander. Generate a dialog turn.",
  "completion": "We need to check the systems."
}
SNAG training data:
{
  "prompt": "[INST] You are Webb (military_commander archetype). State: emotional_valence=-0.2, arousal=0.6, energy=72. Knowledge provenance: 'O2 reading 847 ppm' (learned from Chen at T1, confidence 0.9), 'Threshold 800 ppm' (mission briefing T0). Relationships: Chen +0.3 trust. Causal history: Sensor alert → Disagreement with Chen → Current timepoint. Portal mode: T3 of 5, working backward from mission failure. Context: Late afternoon (circadian penalty 1.0), confined space (atmosphere: tension 0.7). Character arc: 2 prior data_arguments dismissed by crew. Generate your next dialog turn responding to Chen's concern about the O2 reading. [/INST]",
  "completion": "The reading's at 847. That's 6% over spec. Chen, run a calibration check. If it's still high in 30 minutes, we scrub the EVA.",
  "metadata": {
    "mechanisms": ["M3", "M6", "M7", "M8", "M10", "M11", "M13"],
    "temporal_mode": "portal",
    "archetype": "military_commander",
    "training_safe": true
  }
}
The SNAG version includes:
  • ✅ Quantitative emotional state
  • ✅ Knowledge provenance (who told them, when, confidence)
  • ✅ Causal history leading to this moment
  • ✅ Relationship dynamics
  • ✅ Character arc (past failures influencing tactics)
  • ✅ Circadian and atmospheric context
  • ✅ Temporal mode constraints (Portal backward reasoning)
This trains models on how social state influences language, not just language patterns.

Data Diversity

Generate diverse training sets using: Branching mode:
"temporal": {
  "mode": "branching",
  "enable_counterfactuals": true,
  "path_count": 5
}
Produces 5 timeline variants from the same initial conditions → diverse outputs. Variations:
"variations": {
  "enabled": true,
  "count": 10,
  "strategies": ["vary_personalities", "vary_outcomes"],
  "deduplication_threshold": 0.9
}
Runs the same scenario 10 times with varied entity personalities → diverse character voices.

Use Cases

Fine-Tuning Causal Reasoning Models

Portal mode data trains models to reason backward from outcomes:
./run.sh run --mode portal mars_mission_portal
Training objective:
Given outcome: "Mission fails due to life support failure"
Generate: Plausible causal chain of antecedent states

Fine-Tuning Roleplay Models

Dialog with archetype profiles trains character consistency:
./run.sh run board_meeting
Training objective:
Given personality traits + archetype + emotional state:
Generate: Contextually appropriate dialog in character voice

Fine-Tuning Multi-Agent Models

Branching mode trains models to predict divergent outcomes:
./run.sh run castaway_colony_branching
Training objective:
Given initial state + intervention:
Generate: Divergent timeline showing causal consequences

Diffusion Model Conditioning

Future use case: Train diffusion models conditioned on temporal causal graphs:
# Hypothetical future API
model.train(
    condition=causal_graph,
    target=entity_states,
    objective="predict_future_state"
)

Best Practices

Balance Quality and Quantity

High-quality (expensive):
./run.sh run --model meta-llama/llama-3.1-405b-instruct mars_mission_portal
# Cost: ~$2.00 per run
# Quality: Excellent causal reasoning
Medium-quality (balanced):
./run.sh run --model meta-llama/llama-3.1-70b-instruct mars_mission_portal
# Cost: ~$0.40 per run
# Quality: Good for most use cases
High-volume (cheap):
./run.sh run --model deepseek/deepseek-chat convergence/simple
# Cost: ~$0.02 per run
# Quality: Acceptable for bulk data

Filter by Mechanism

Generate data targeting specific capabilities:
# Causal reasoning: M3 + M7
templates = ["mars_mission_portal", "agent3_litigation_portal"]

# Multi-agent negotiation: M11 + M13
templates = ["board_meeting", "vc_pitch_branching"]

# Embodied cognition: M8 + M14
templates = ["hospital_crisis"]

# Counterfactual reasoning: M12
templates = ["castaway_colony_branching", "agent2_mission_failure"]

Validate Data Quality

Run convergence tests to verify data stability:
./run.sh run convergence/simple --repeat 5
# Check Jaccard similarity > 0.7 across runs
High convergence = reliable training data.

Version Control with Oxen

Use Oxen.ai to track dataset lineage:
export OXEN_API_KEY=your_key

# Each run automatically tagged with:
# - Template name
# - Temporal mode
# - Mechanism set
# - Model used
# - Cost and token count
Query historical runs:
oxen log --filter "temporal_mode=portal" --filter "cost_usd<0.50"

Data Privacy

Local-only by default:
  • All data stays in metadata/runs.db
  • No external services called unless explicitly configured
Cloud upload (opt-in):
  • Requires OXEN_API_KEY set
  • Only uploads when export_ml_dataset=true
Sensitive scenarios: For proprietary or sensitive scenarios:
"outputs": {
  "export_ml_dataset": false  // Disable dataset export
}
Data remains local in SQLite.

Next Steps

  • Learn about Cost Optimization to balance training data quality and cost
  • Read Validation to understand data quality checks
  • Explore Templates to configure training data export settings