LabChain
Build reproducible ML experiments with automatic inter-team result reuse.
The Problem¶
Python's flexibility accelerates research prototyping but frequently results in unmaintainable code and duplicated computational effort. When evaluating multiple classifiers on the same embeddings, researchers typically recompute those embeddings for each classifier—wasting hours of computation and CO₂ emissions.
Traditional workflow frameworks (scikit-learn, Kedro, Snakemake) don't solve the fundamental inefficiency: identical transformations computed by different team members are never automatically reused.
The Solution¶
LabChain uses hash-based caching with content-addressable storage to automatically identify and reuse intermediate results. When your colleague applies different models to the same preprocessed data, LabChain detects existing results and eliminates redundant computation—without manual coordination.
from labchain import F3Pipeline, XYData
from labchain.plugins.filters import Cached, StandardScalerPlugin, KnnFilter
# Wrap expensive operations with automatic caching
pipeline = F3Pipeline(
filters=[
Cached(
filter=DeepLearningEmbeddings(), # Computed once
cache_data=True,
cache_filter=True
),
KnnFilter() # Swap classifiers freely
]
)
# First run: computes and caches embeddings
pipeline.fit(x_train, y_train)
# Subsequent runs or other team members: instant cache hit
# Even with different classifiers!
🔄 Automatic Caching
Cryptographic hashing identifies identical computations. Share results across your team with zero configuration.📦 Executable Configs
Pipelines serialize to JSON. Each dump is a complete, reproducible experiment.🎯 Built-in Optimization
Grid search, Bayesian (Optuna), or Weights & Biases integration. Define search space once.✅ Cross-Validation
K-Fold and Stratified K-Fold with automatic metric aggregation and std reporting.☁️ Cloud Storage
Native S3 support. Share cached results across geographical locations.🧩 Modular Design
Filters, metrics, optimizers—all pluggable. Extend by inheriting base classes.Quick Start¶
Install via pip:
Build your first pipeline:
from labchain import F3Pipeline, XYData
from labchain.plugins.filters import StandardScalerPlugin, KnnFilter
from labchain.plugins.metrics import F1, Precission, Recall
pipeline = F3Pipeline(
filters=[
StandardScalerPlugin(),
KnnFilter(n_neighbors=5)
],
metrics=[F1(), Precission(), Recall()]
)
pipeline.fit(x_train, y_train)
predictions = pipeline.predict(x_test)
results = pipeline.evaluate(x_test, y_test, predictions)
# {'F1': 0.96, 'Precision': 0.97, 'Recall': 0.95}
Want optimization? Just add it:
from labchain.plugins.optimizer import OptunaOptimizer
pipeline.optimizer(
OptunaOptimizer(
direction="maximize",
n_trials=50,
scorer=F1()
)
)
Want cross-validation? Stack it:
from labchain.plugins.splitter import KFoldSplitter
pipeline.splitter(
KFoldSplitter(n_splits=5, shuffle=True)
)
# Now returns {'F1': 0.85, 'F1_std': 0.03, 'F1_scores': [...]}
Architecture¶
LabChain follows a pipeline-and-filter architecture where filters are composable transformations with fit() and predict() methods:
%%{init: {
'theme': 'dark',
'themeVariables': {
'background': 'transparent',
'mainBkg': 'transparent',
'secondaryBkg': 'transparent',
'tertiaryBkg': 'transparent'
}}}%%
graph TB
subgraph data["Data Layer"]
XY["XYData<br/>Content-Addressable"]
Cache["Cached<br/>Hash-Based"]
end
subgraph core["Core Components"]
BF["BaseFilter<br/>Transforms"]
BP["BasePipeline<br/>Orchestration"]
BM["BaseMetric<br/>Evaluation"]
BS["BaseSplitter<br/>Cross-Validation"]
BO["BaseOptimizer<br/>Tuning"]
BST["BaseStorage<br/>Persistence"]
end
subgraph plugins["Plugin Ecosystem"]
F["Filters<br/>Scaler, PCA, KNN"]
M["Metrics<br/>F1, Precision"]
O["Optimizers<br/>Grid, Optuna"]
end
Container["Container<br/>Dependency Injection"]
XY --> BF
BF --> BP
BP --> BM
BP --> BS
BP --> BO
Cache --> BST
Container -.-> F
Container -.-> M
Container -.-> O
F -.-> BF
M -.-> BM
O -.-> BO
classDef coreStyle fill:#2E3B42,stroke:#546E7A,stroke-width:3px,color:#ECEFF1
classDef dataStyle fill:#263238,stroke:#78909C,stroke-width:3px,color:#ECEFF1
classDef pluginStyle fill:#1E272C,stroke:#4DB6AC,stroke-width:3px,color:#ECEFF1
classDef containerStyle fill:#37474F,stroke:#90A4AE,stroke-width:4px,color:#FFFFFF
class XY,Cache dataStyle
class BF,BP,BM,BS,BO,BST coreStyle
class F,M,O pluginStyle
class Container containerStyle
Key abstraction: Each filter has a unique hash computed from its class name, public parameters, and input data hash. This forms a provenance chain that enables automatic cache hits when configurations match.
Real-World Impact¶
A published case study on mental health detection using temporal language analysis demonstrated:
- 12+ hours of computation saved per task through caching
- 2.5–11 kg CO₂ emissions avoided (conservative estimates)
- Up to 192% performance improvement in some tasks compared to the original monolithic implementation
The performance gains emerged because modular design exposed a critical preprocessing bug that remained hidden in unstructured code. When embeddings were cached as an explicit filter, the bug became immediately visible during component validation.
Why LabChain Over Alternatives?¶
| Feature | LabChain | scikit-learn | Kedro | Snakemake |
|---|---|---|---|---|
| Automatic inter-team caching | ✅ Hash-based | ❌ None | ❌ Manual files | ❌ Timestamps |
| Executable configuration | ✅ JSON → code | ❌ Pickle only | ❌ YAML + code | ❌ Rules only |
| State management | ✅ Filter-internal | Pipeline objects | Catalog flow | File artifacts |
| Cloud storage | ✅ Native S3 | ❌ | Plugin | Limited |
| Setup overhead | Minimal | None | High | Medium |
| Target use case | Iterative research | Model building | Production ETL | Bioinformatics |
LabChain's niche: Collaborative research where multiple people explore variations on expensive preprocessing pipelines.
Core Concepts¶
BaseFilter — Any data transformation. Implement fit(x, y) and predict(x).
BasePipeline — Chains filters. Supports sequential, parallel, or MapReduce execution.
BaseMetric — Evaluation function. Knows if higher/lower is better for optimization.
BaseStorage — Persistence backend. Swap local/S3/custom without changing code.
Container — Dependency injection. Registers components via @Container.bind().
XYData — Data wrapper with content-addressable hash for cache lookups.
Extension Example¶
Creating a custom filter is straightforward:
from labchain import BaseFilter, XYData, Container
@Container.bind()
class MyTransformer(BaseFilter):
def __init__(self, scale: float = 1.0):
super().__init__(scale=scale)
self.scale = scale
self._mean = None # Private state
def fit(self, x: XYData, y: XYData | None):
self._mean = x.value.mean()
def predict(self, x: XYData) -> XYData:
transformed = (x.value - self._mean) * self.scale
return XYData.mock(transformed)
Key insight: Public attributes (like scale) define the filter's identity and appear in the hash. Private attributes (like _mean) are internal state excluded from serialization.
Production Readiness¶
LabChain is actively maintained and used daily in production research at CiTIUS.
Stable components: Core filters, pipelines, Container, caching, data splitting, WandB optimizer.
Experimental components: Sklearn/Optuna optimizers, distributed execution.
Roadmap:
- Graphical interfaces for non-programmers
- Federated execution across institutions
- Deep PyTorch integration (cacheable layer-wise pretraining)
Learn More¶
- Quick Start Guide — Step-by-step tutorial
- Start Caching — Step-by-step tutorial
- Architecture Deep Dive — Technical details
- Examples — Real-world use cases
- API Reference — Complete documentation
Community¶
- GitHub: manucouto1/LabChain
- Documentation: manucouto1.github.io/LabChain
- Paper: SoftwareX submission (preprint)
- Case Study: Mental health detection
LabChain is licensed under AGPL-3.0. Contributions welcome!
Ready to stop recomputing? Install labchain and start caching. Get Started →