LabChain

Stop wasting computation. Start caching results.
Build reproducible ML experiments with automatic inter-team result reuse.

Get started View on GitHub

12+ hrs Computation saved per task in production

192% Performance improvement in mental health detection

0 KB Redundant data recomputation with smart caching

The Problem¶

Python's flexibility accelerates research prototyping but frequently results in unmaintainable code and duplicated computational effort. When evaluating multiple classifiers on the same embeddings, researchers typically recompute those embeddings for each classifier—wasting hours of computation and CO₂ emissions.

Traditional workflow frameworks (scikit-learn, Kedro, Snakemake) don't solve the fundamental inefficiency: identical transformations computed by different team members are never automatically reused.

The Solution¶

LabChain uses hash-based caching with content-addressable storage to automatically identify and reuse intermediate results. When your colleague applies different models to the same preprocessed data, LabChain detects existing results and eliminates redundant computation—without manual coordination.

from labchain import F3Pipeline, XYData
from labchain.plugins.filters import Cached, StandardScalerPlugin, KnnFilter

# Wrap expensive operations with automatic caching
pipeline = F3Pipeline(
    filters=[
        Cached(
            filter=DeepLearningEmbeddings(),  # Computed once
            cache_data=True,
            cache_filter=True
        ),
        KnnFilter()  # Swap classifiers freely
    ]
)

# First run: computes and caches embeddings
pipeline.fit(x_train, y_train)

# Subsequent runs or other team members: instant cache hit
# Even with different classifiers!

🔄 Automatic Caching

Cryptographic hashing identifies identical computations. Share results across your team with zero configuration.

📦 Executable Configs

Pipelines serialize to JSON. Each dump is a complete, reproducible experiment.

🎯 Built-in Optimization

Grid search, Bayesian (Optuna), or Weights & Biases integration. Define search space once.

✅ Cross-Validation

K-Fold and Stratified K-Fold with automatic metric aggregation and std reporting.

☁️ Cloud Storage

Native S3 support. Share cached results across geographical locations.

🧩 Modular Design

Filters, metrics, optimizers—all pluggable. Extend by inheriting base classes.

Quick Start¶

Install via pip:

pip install framework3

Build your first pipeline:

from labchain import F3Pipeline, XYData
from labchain.plugins.filters import StandardScalerPlugin, KnnFilter
from labchain.plugins.metrics import F1, Precission, Recall

pipeline = F3Pipeline(
    filters=[
        StandardScalerPlugin(),
        KnnFilter(n_neighbors=5)
    ],
    metrics=[F1(), Precission(), Recall()]
)

pipeline.fit(x_train, y_train)
predictions = pipeline.predict(x_test)
results = pipeline.evaluate(x_test, y_test, predictions)
# {'F1': 0.96, 'Precision': 0.97, 'Recall': 0.95}

Want optimization? Just add it:

from labchain.plugins.optimizer import OptunaOptimizer

pipeline.optimizer(
    OptunaOptimizer(
        direction="maximize",
        n_trials=50,
        scorer=F1()
    )
)

Want cross-validation? Stack it:

from labchain.plugins.splitter import KFoldSplitter

pipeline.splitter(
    KFoldSplitter(n_splits=5, shuffle=True)
)

# Now returns {'F1': 0.85, 'F1_std': 0.03, 'F1_scores': [...]}

Architecture¶

LabChain follows a pipeline-and-filter architecture where filters are composable transformations with fit() and predict() methods:

%%{init: {
  'theme': 'dark',
  'themeVariables': {
    'background': 'transparent',
    'mainBkg': 'transparent',
    'secondaryBkg': 'transparent',
    'tertiaryBkg': 'transparent'
}}}%%
graph TB
    subgraph data["Data Layer"]
        XY["XYData<br/>Content-Addressable"]
        Cache["Cached<br/>Hash-Based"]
    end

    subgraph core["Core Components"]
        BF["BaseFilter<br/>Transforms"]
        BP["BasePipeline<br/>Orchestration"]
        BM["BaseMetric<br/>Evaluation"]
        BS["BaseSplitter<br/>Cross-Validation"]
        BO["BaseOptimizer<br/>Tuning"]
        BST["BaseStorage<br/>Persistence"]
    end

    subgraph plugins["Plugin Ecosystem"]
        F["Filters<br/>Scaler, PCA, KNN"]
        M["Metrics<br/>F1, Precision"]
        O["Optimizers<br/>Grid, Optuna"]
    end

    Container["Container<br/>Dependency Injection"]

    XY --> BF
    BF --> BP
    BP --> BM
    BP --> BS
    BP --> BO
    Cache --> BST

    Container -.-> F
    Container -.-> M
    Container -.-> O

    F -.-> BF
    M -.-> BM
    O -.-> BO

    classDef coreStyle fill:#2E3B42,stroke:#546E7A,stroke-width:3px,color:#ECEFF1
    classDef dataStyle fill:#263238,stroke:#78909C,stroke-width:3px,color:#ECEFF1
    classDef pluginStyle fill:#1E272C,stroke:#4DB6AC,stroke-width:3px,color:#ECEFF1
    classDef containerStyle fill:#37474F,stroke:#90A4AE,stroke-width:4px,color:#FFFFFF


    class XY,Cache dataStyle
    class BF,BP,BM,BS,BO,BST coreStyle
    class F,M,O pluginStyle
    class Container containerStyle

Key abstraction: Each filter has a unique hash computed from its class name, public parameters, and input data hash. This forms a provenance chain that enables automatic cache hits when configurations match.

Real-World Impact¶

A published case study on mental health detection using temporal language analysis demonstrated:

12+ hours of computation saved per task through caching
2.5–11 kg CO₂ emissions avoided (conservative estimates)
Up to 192% performance improvement in some tasks compared to the original monolithic implementation

The performance gains emerged because modular design exposed a critical preprocessing bug that remained hidden in unstructured code. When embeddings were cached as an explicit filter, the bug became immediately visible during component validation.

Why LabChain Over Alternatives?¶

Feature	LabChain	scikit-learn	Kedro	Snakemake
Automatic inter-team caching	✅ Hash-based	❌ None	❌ Manual files	❌ Timestamps
Executable configuration	✅ JSON → code	❌ Pickle only	❌ YAML + code	❌ Rules only
State management	✅ Filter-internal	Pipeline objects	Catalog flow	File artifacts
Cloud storage	✅ Native S3	❌	Plugin	Limited
Setup overhead	Minimal	None	High	Medium
Target use case	Iterative research	Model building	Production ETL	Bioinformatics

LabChain's niche: Collaborative research where multiple people explore variations on expensive preprocessing pipelines.

Core Concepts¶

BaseFilter — Any data transformation. Implement fit(x, y) and predict(x).

BasePipeline — Chains filters. Supports sequential, parallel, or MapReduce execution.

BaseMetric — Evaluation function. Knows if higher/lower is better for optimization.

BaseStorage — Persistence backend. Swap local/S3/custom without changing code.

Container — Dependency injection. Registers components via @Container.bind().

XYData — Data wrapper with content-addressable hash for cache lookups.

Extension Example¶

Creating a custom filter is straightforward:

from labchain import BaseFilter, XYData, Container

@Container.bind()
class MyTransformer(BaseFilter):
    def __init__(self, scale: float = 1.0):
        super().__init__(scale=scale)
        self.scale = scale
        self._mean = None  # Private state

    def fit(self, x: XYData, y: XYData | None):
        self._mean = x.value.mean()

    def predict(self, x: XYData) -> XYData:
        transformed = (x.value - self._mean) * self.scale
        return XYData.mock(transformed)

Key insight: Public attributes (like scale) define the filter's identity and appear in the hash. Private attributes (like _mean) are internal state excluded from serialization.

Production Readiness¶

LabChain is actively maintained and used daily in production research at CiTIUS.

Stable components: Core filters, pipelines, Container, caching, data splitting, WandB optimizer.

Experimental components: Sklearn/Optuna optimizers, distributed execution.

Roadmap:

Graphical interfaces for non-programmers
Federated execution across institutions
Deep PyTorch integration (cacheable layer-wise pretraining)

Learn More¶

Quick Start Guide — Step-by-step tutorial
Start Caching — Step-by-step tutorial
Architecture Deep Dive — Technical details
Examples — Real-world use cases
API Reference — Complete documentation

Community¶

GitHub: manucouto1/LabChain
Documentation: manucouto1.github.io/LabChain
Paper: SoftwareX submission (preprint)
Case Study: Mental health detection

LabChain is licensed under AGPL-3.0. Contributions welcome!

Ready to stop recomputing? Install labchain and start caching. Get Started →