Knowledge Graph Architecture
The graph system PGS queries is not a vector store and not a property graph in the Neo4j sense. It is a hybrid that combines dense embeddings, typed edges, spreading activation, Hebbian learning, and state-dependent topology maintenance into a living, continuously-evolving knowledge structure.
A Hybrid Knowledge Structure
The COSMO knowledge graph is a live, continuously-mutating in-memory structure that persists to disk between runs. It combines five architectural layers that no single existing system provides together:
- Dense embeddings on every node (512-dimensional Float32 vectors) for semantic similarity lookups
- Typed edges (13 types across 4 families) for explicit causal and structural reasoning
- Spreading activation for associative recall -- the primary retrieval mechanism, not just nearest-neighbor search
- Hebbian learning -- edges reinforce on co-activation and weaken over time if not reinforced
- Watts-Strogatz topology maintenance -- controllable rewiring probability to maintain small-world properties, with state-dependent parameters
On top of these: temporal decay on both nodes and edges to suppress stale knowledge, quality gating at insertion time so garbage never enters the graph, and LLM-driven consolidation to compress redundant clusters into higher-level abstractions.
The graph is the single source of truth for what the system knows. It is also the substrate that accumulates across runs (via fork/merge) and across the lifetime of a brain. When PGS queries this graph, it partitions by topology, sweeps at full fidelity, and synthesizes across partitions -- but the graph itself is doing continuous work between queries: decaying stale nodes, reinforcing co-activated edges, rewiring topology, and consolidating redundant memories.
Node Schema
Each node is a plain JavaScript object stored in a Map<id, node>. The complete field specification:
| Field | Type | Description |
|---|---|---|
id |
number | string | Auto-assigned. Numeric by default; string format when loaded from a merged brain. |
concept |
string | The full text content of this memory. May be hundreds or thousands of characters. |
summary |
string | null | Extractive compression for prompt injection. Only populated when quality score >= 0.6. |
keyPhrase |
string | null | Ultra-compressed single phrase for quick-reference display. |
tag |
string | Categorical label. Drives quality gate routing, edge type inference, GC protection. |
embedding |
Float32Array[512] | Dense semantic vector from text-embedding-3-small. Null only if embedding failed. |
activation |
number (0-1) | Current spreading activation level. Updated by spreadActivation() per query. |
cluster |
number | null | Cluster membership ID assigned by neighbor voting. |
weight |
number (0-1) | Relevance weight. Starts at 1.0, decays over time, boosted +0.1 on access. |
created |
Date | Insertion timestamp. Used for temporal weighting in queries. |
accessed |
Date | Last retrieval timestamp. Used for decay age calculation. |
accessCount |
number | Number of times this node has appeared in query results. |
consolidatedAt |
string | null | ISO timestamp set by consolidation. Prevents re-clustering and grants GC exemption. |
sourceRuns |
any | null | Set when inherited from a merged brain. Absolute GC exemption. |
mergedAt |
any | null | Set when imported via merge. Absolute GC exemption. |
domain |
string | null | Set on merged/inherited nodes. Non-'unknown' value is an absolute GC exemption. |
Edge Types: 13 Types, 4 Families
Edges are stored in a Map<string, edge> where the key is "{lowerNodeId}->{higherNodeId}" (lexicographic sort). Each edge carries source, target, weight, type, created, and accessed timestamps.
Generic Family
- ASSOCIATIVE -- generic semantic association. Default when no more specific type applies.
- BRIDGE -- cross-cluster connection. Created by
addRandomBridge()andrewireSmallWorld(). Weight 0.3. Never rewired by Watts-Strogatz.
Causal Family
- TRIGGERED_BY -- user prompt triggered tool call; goal triggered agent spawn.
- CAUSED_BY -- effect to root cause. Set when a failure node links to a root cause node.
- RESOLVED_BY -- failure to resolution. Set when a success node links to a failure node.
Semantic Family
- CONTRADICTS -- failed hypothesis to counter-evidence.
- VALIDATES -- evidence to claim.
- REFINES -- later attempt to earlier attempt. Requires temporal refinement context.
- SYNTHESIZES -- set when a node carries a synthesis or consolidated tag.
Structural Family
- SUPERSEDES -- new version to old version.
- DEPENDS_ON -- task to prerequisite.
- EXECUTED_BY -- task to agent.
- PRODUCED -- agent to deliverable.
Edge Type Inference
The inferEdgeType() function uses a decision tree based on node tags and caller-provided context. If the caller specifies context.relationship, that value is used directly. Otherwise, the function examines tag patterns: failure + root cause yields CAUSED_BY, task + agent yields EXECUTED_BY, synthesis tags yield SYNTHESIZES, and so on. The default fallback is ASSOCIATIVE.
inferEdgeType(nodeA, nodeB, context)
IF context.relationship -> return context.relationship
IF tagA == 'agent_failure'
AND tagB == 'root_cause' -> CAUSED_BY
IF tagA == 'agent_success'
AND tagB includes 'failure' -> RESOLVED_BY
IF tagA includes 'task'
AND tagB includes 'agent' -> EXECUTED_BY
IF tagA includes 'agent'
AND tagB includes 'deliverable' -> PRODUCED
IF tagA == 'synthesis'
OR tagA == 'consolidated' -> SYNTHESIZES
IF context.temporal == 'refinement' -> REFINES
IF context.temporal == 'supersedes' -> SUPERSEDES
IF context.dependency -> DEPENDS_ON
DEFAULT -> ASSOCIATIVEQuality Gate
Every node insertion passes through a two-tier classifier before the embedding API is called. Returning category === 'operational' or category === 'garbage' rejects the node. Pre-computed embeddings bypass the gate -- external callers that pass a pre-embedded node assert quality themselves.
Tier 1: Pattern Classification
Hard validity checks reject null content, content under 10 characters, content over 50,000 characters, and excessive word repetition (words/unique words ratio > 5). Then a regex-based classifier runs:
- Structural tags (14 tags like
mission_plan,cross_agent_pattern,source_code_analysis) always pass. These carry agent-to-agent JSON communication. - High-value tags (26 tags like
research,analysis,synthesis_report,novel_connection) pass if stripped content length >= 30 characters. - Operational patterns (status updates, batch reports, process notifications) are rejected.
- Error patterns (API failures, connection errors) are rejected unless tagged as
error_report.
Tier 2: Content Density Heuristic
If all pattern gates pass, the content scores on a density formula. Score below 0.4 is rejected.
meaningfulRatio = meaningfulWords / totalWords
// meaningful = length > 3, not a stop word
knowledgeBoost = 0
+ 0.15 if content has 2+ numbers/statistics
+ 0.10 if mid-sentence capitalized words (proper nouns)
+ 0.10 if 3+ words with 8+ characters (technical terms)
+ 0.05 if content includes quotes or URLs
noisePenalty = 0
+ 0.30 if "X of Y succeeded/completed" pattern
+ 0.20 if bare file path lines
score = clamp(0, 1, meaningfulRatio + knowledgeBoost - noisePenalty)
threshold = 0.4The gate ensures that embedding cost is never incurred on garbage content. Only knowledge-bearing content enters the graph.
Spreading Activation
Spreading activation is the primary retrieval mechanism -- not pure nearest-neighbor search, but a BFS traversal that propagates context through the graph's relational structure.
Starting from a seed node (the node with highest cosine similarity to the query embedding), activation propagates outward via edges. At each hop, activation decays by the product of edge weight and a decay factor:
queue = [{ nodeId: seedNode, activation: 1.0, depth: 0 }]
while queue not empty:
{ nodeId, activation, depth } = queue.shift()
if depth > maxDepth OR activation < threshold: continue
for each neighbor of nodeId:
edge = getEdge(nodeId, neighbor)
newActivation = activation * edge.weight * decayFactor
if newActivation >= threshold:
queue.push({ neighbor, newActivation, depth + 1 })Temporal Blending
After spreading activation, effective activation is blended with a temporal weight based on node age. This happens at query time, not at write time, so archived brains retain full retrieval fidelity on re-inspection.
temporalWeight = 0.5 ^ (age / halfLife) // half-life: 7 days
effectiveActivation = activation * (0.7 + 0.3 * temporalWeight)
A node created today has temporalWeight ~ 1.0, so effectiveActivation ~ activation. A node 7 days old has temporalWeight = 0.5, so effectiveActivation = activation * 0.85. A node 30 days old has temporalWeight ~ 0.05, so effectiveActivation ~ activation * 0.715. Recent knowledge is preferred but old knowledge is never fully suppressed.
Hebbian Learning
Three reinforcement pathways implement the "fire together, wire together" principle:
Edge Reinforcement on Co-activation
When an edge already exists between two nodes, adding a new connection reinforces it rather than creating a duplicate:
addEdge(nodeA, nodeB, weight, type)
if edge exists:
existing.weight = min(1.0, existing.weight + weight) // Hebbian reinforcement
existing.accessed = new Date()
else:
create new edgeCo-occurrence Strengthening
When multiple nodes appear together in query results, all pairwise connections between them are reinforced. Nodes that are repeatedly retrieved together develop stronger direct connections, creating retrieval pathways that mirror usage patterns.
Access Reinforcement
Each query hit boosts a node's weight by 0.1 (capped at 1.0). Frequently-retrieved nodes resist decay and rise in future retrieval rankings. This creates a positive feedback loop: useful knowledge becomes easier to find.
// On every query hit:
node.accessed = new Date()
node.accessCount++
node.weight = min(1.0, node.weight + 0.1)Small-World Topology
This is the most novel aspect of the architecture. The graph maintains small-world properties through Watts-Strogatz rewiring with state-dependent probabilities, simulating the difference between waking cognition (conservative, maintaining structure) and sleep consolidation (chaotic, creating cross-domain bridges).
Watts-Strogatz Rewiring
For each intra-cluster edge, with probability p: remove the edge, find a random node in a different cluster, and create a new bridge edge to it. Bridge edges (type 'bridge', weight 0.3) are exempt from future rewiring. If no valid target is found after repeated attempts, the original edge is restored.
State-Dependent Parameters
| State | Rewiring Probability | Frequency | Effect |
|---|---|---|---|
| Wake | p = 0.01 |
Every 30 cycles | Conservative: 1% of intra-cluster edges get rewired. Maintains existing structure while slowly adding cross-cluster paths. |
| Dream | p = 0.5 |
During dream state | Chaotic: half of all intra-cluster edges are redirected to cross-cluster bridges. Substantial topological restructuring. |
Dream-state rewiring creates the creative cross-domain associations that characterize the system's consolidation behavior. The high p=0.5 means the graph topology is substantially restructured each dream cycle, creating new retrieval pathways between previously disconnected knowledge clusters. Over multiple dream cycles, this produces a single navigable component with short average path length despite high clustering -- the small-world property.
The closest analog in cognitive science is the distinction between consolidation during wakefulness (incremental, stability-preserving) and consolidation during sleep (radical, creating novel associations through hippocampal replay). The graph instantiates this distinction through a single parameter.
Decay and Garbage Collection
Weight Decay
applyDecay() runs every 30 cycles during wake state. For each non-exempt node, if the time since last access exceeds the decay interval, the node's weight is multiplied by a base factor (< 1.0). Edges decay at half the node rate -- their interval is doubled. This means edge weights are more stable than node weights, preserving relational structure even as individual nodes weaken.
Decay does not delete. The decay system only reduces weights. Decayed nodes with weight above 0.01 remain retrievable indefinitely; they simply rank lower in activation-sorted results. Exempt tags (specified in configuration) skip decay entirely.
Ultra-Conservative Garbage Collection
A node reaches deletion only when all of the following are simultaneously true:
- Not consolidated (
consolidatedAtis null) - Not inherited or merged (
sourceRuns,mergedAtare null) - Not a protected tag (18 protected tags including
research,analysis,synthesis,breakthrough,execution_result) - Weight below 0.01
- Not accessed for 730+ days (2 years)
In practice, the weight decay system means most nodes reach weight < 0.01 well before 2 years -- but the access time guard means any node touched in the last 2 years is permanently safe regardless of weight. The system is biased toward retention over deletion.
LLM-Driven Memory Consolidation
When the graph accumulates enough similar nodes, LLM-driven consolidation compresses redundant clusters into higher-level abstractions.
Clustering
Greedy single-pass clustering at cosine similarity threshold >= 0.75. Only unconsolidated nodes are considered. Clusters of 3 or more similar nodes are candidates for consolidation.
Synthesis
Each cluster is sent to an LLM with instructions to synthesize related memory entries into a single, higher-level insight. The prompt focuses on what the entries collectively reveal that no single entry shows alone -- the emerging pattern, trend, or conclusion. Specific evidence, numbers, and named entities are preserved.
After successful consolidation, every source node receives a consolidatedAt timestamp. This prevents re-clustering in future passes and grants absolute GC exemption.
Recursive Hierarchical Summarization
Up to 3 levels of summarization. Each level reduces the entry count by approximately 1/20. A 400-entry journal produces ~20 level-0 summaries, ~1 level-1 summary, then stops. This creates a natural hierarchy: raw observations compress into findings, findings compress into insights, insights compress into conclusions.
Comparison to Existing Systems
| Capability | Neo4j | HNSW Stores (Pinecone, Weaviate) | Microsoft GraphRAG | COSMO Knowledge Graph |
|---|---|---|---|---|
| Dense embeddings per node | No (external plugin) | Yes | During indexing | Yes (512-dim, all nodes) |
| Typed edges | Yes | No | Co-occurrence only | 13 types, 4 families |
| Spreading activation | No | No | No | BFS with temporal blending |
| Hebbian learning | No | No | No | Edge reinforcement + co-occurrence |
| Topology rewiring | No | No | No | Watts-Strogatz, state-dependent |
| Temporal decay | Manual TTL | Manual TTL | No | Continuous, query-time blending |
| Quality gating | No | No | No | Two-tier classifier at insertion |
| Online ingestion | Yes | Yes | Batch pipeline | Continuous, per-node |
| LLM consolidation | No | No | Community reports (offline) | Live, hierarchical (3 levels) |
| Causal traversal | Cypher queries | No | No | traceCausalChain() |
What Is Genuinely New
No single technique here is novel in isolation. Spreading activation dates to Collins and Loftus (1975). Hebbian learning is fundamental neuroscience. Watts-Strogatz small-world networks were characterized in 1998. Quality gating is standard data engineering. What is new is the combination -- and specifically, the state-dependent topology maintenance.
The combination that is not present in any single existing system:
- Continuous online ingestion with quality gating -- knowledge enters the graph incrementally as it is produced, and garbage is rejected before embedding cost is incurred.
- Spreading activation over a Hebbian-weighted graph -- retrieval is not pure nearest-neighbor but propagates context through reinforced relational paths. Usage shapes future retrieval.
- State-dependent topology -- wake state (
p=0.01, conservative maintenance) and dream state (p=0.5, chaotic cross-cluster bridging) give the graph a neurodynamic quality absent from static indexes. The graph topology itself is a function of cognitive state. - 13 semantic edge types -- causal traversal, dependency tracking, validation chains, and synthesis relationships are first-class. These enable graph-RAG queries that pure embedding stores cannot answer.
- Temporal decay at query time, not write time -- archived brains are not degraded. Recency is applied as a scoring adjustment at retrieval, so the same graph can be queried with different temporal perspectives.
The closest analog is a hybrid of Ratcliff's global matching models (spreading activation with weighted similarity), the Watts-Strogatz small-world network model (rewiring probability controlling topology), and HNSW anchoring (dense embeddings for seed selection). But no existing system combines these with state-dependent rewiring, Hebbian reinforcement, and LLM-driven consolidation in a single live knowledge structure.