Skip to content

Types

framework3.base.base_types

Float = float | np.float16 | np.float32 | np.float64 module-attribute

Type alias for float values, including numpy float types.

IncEx = 'set[int] | set[str] | dict[int, Any] | dict[str, Any] | None' module-attribute

Type alias for inclusion/exclusion specifications in data processing.

SkVData = np.ndarray | pd.DataFrame | spmatrix | csr_matrix module-attribute

Type alias for scikit-learn compatible data structures.

TxyData = TypeVar('TxyData', SkVData, VData) module-attribute

Type variable constrained to SkVData or VData for use in XYData.

TypePlugable = TypeVar('TypePlugable') module-attribute

Generic type variable for pluggable types in the framework.

VData = np.ndarray | pd.DataFrame | spmatrix | list | torch.Tensor module-attribute

Type alias for various data structures used in the framework.

__all__ = ['XYData', 'VData', 'SkVData', 'IncEx', 'TypePlugable'] module-attribute

JsonEncoderkwargs

Bases: TypedDict

Source code in framework3/base/base_types.py
class JsonEncoderkwargs(TypedDict, total=False):
    exclude: IncEx | None
    by_alias: bool
    exclude_unset: bool
    exclude_defaults: bool
    exclude_none: bool
    sqlalchemy_safe: bool
by_alias instance-attribute
exclude instance-attribute
exclude_defaults instance-attribute
exclude_none instance-attribute
exclude_unset instance-attribute
sqlalchemy_safe instance-attribute

XYData dataclass

Bases: Generic[TxyData]

A dataclass representing data for machine learning tasks, typically features (X) or targets (Y).

This class is immutable and uses slots for memory efficiency. It provides a standardized way to handle various types of data used in machine learning pipelines.

Attributes:

Name Type Description
_hash str

A unique identifier or hash for the data.

_path str

The path where the data is stored or retrieved from.

_value TxyData | Callable[..., TxyData]

The actual data or a callable that returns the data.

Methods:

Name Description
train_test_split

Split the data into training and testing sets.

split

Create a new XYData instance with specified indices.

mock

Create a mock XYData instance for testing or placeholder purposes.

concat

Concatenate a list of data along the specified axis.

ensure_dim

Ensure the input data has at least two dimensions.

as_iterable

Convert the data to an iterable form.

Example
import numpy as np
from framework3.base.base_types import XYData

# Create a mock XYData instance with random data
features = np.random.rand(100, 5)
labels = np.random.randint(0, 2, 100)

x_data = XYData.mock(features, hash="feature_data", path="/data/features")
y_data = XYData.mock(labels, hash="label_data", path="/data/labels")

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = x_data.train_test_split(x_data.value, y_data.value, test_size=0.2)

# Access the data
print(f"Training features shape: {X_train.value.shape}")
print(f"Training labels shape: {y_train.value.shape}")

# Create a subset of the data
subset = x_data.split(range(50))
print(f"Subset shape: {subset.value.shape}")
Note

This class is designed to work with various data types including numpy arrays, pandas DataFrames, scipy sparse matrices, and PyTorch tensors.

Source code in framework3/base/base_types.py
@dataclass(slots=True)
class XYData(Generic[TxyData]):
    """
    A dataclass representing data for machine learning tasks, typically features (X) or targets (Y).

    This class is immutable and uses slots for memory efficiency. It provides a standardized
    way to handle various types of data used in machine learning pipelines.

    Attributes:
        _hash (str): A unique identifier or hash for the data.
        _path (str): The path where the data is stored or retrieved from.
        _value (TxyData | Callable[..., TxyData]): The actual data or a callable that returns the data.

    Methods:
        train_test_split: Split the data into training and testing sets.
        split: Create a new XYData instance with specified indices.
        mock: Create a mock XYData instance for testing or placeholder purposes.
        concat: Concatenate a list of data along the specified axis.
        ensure_dim: Ensure the input data has at least two dimensions.
        as_iterable: Convert the data to an iterable form.

    Example:
        ```python
        import numpy as np
        from framework3.base.base_types import XYData

        # Create a mock XYData instance with random data
        features = np.random.rand(100, 5)
        labels = np.random.randint(0, 2, 100)

        x_data = XYData.mock(features, hash="feature_data", path="/data/features")
        y_data = XYData.mock(labels, hash="label_data", path="/data/labels")

        # Split the data into training and testing sets
        X_train, X_test, y_train, y_test = x_data.train_test_split(x_data.value, y_data.value, test_size=0.2)

        # Access the data
        print(f"Training features shape: {X_train.value.shape}")
        print(f"Training labels shape: {y_train.value.shape}")

        # Create a subset of the data
        subset = x_data.split(range(50))
        print(f"Subset shape: {subset.value.shape}")
        ```

    Note:
        This class is designed to work with various data types including numpy arrays,
        pandas DataFrames, scipy sparse matrices, and PyTorch tensors.
    """

    _hash: str = field(init=True)
    _path: str = field(init=True)
    _value: TxyData | Callable[..., TxyData] = field(init=True, repr=False)

    def train_test_split(
        self, x: TxyData, y: TxyData | None, test_size: float, random_state: int = 42
    ) -> Tuple[XYData, XYData, XYData, XYData]:
        """
        Split the data into training and testing sets.

        This method uses sklearn's train_test_split function to divide the data
        into training and testing sets for both features (X) and targets (Y).

        Args:
            x (TxyData): The feature data to split.
            y (TxyData | None): The target data to split. Can be None for unsupervised learning.
            test_size (float): The proportion of the data to include in the test split (0.0 to 1.0).
            random_state (int, optional): Seed for the random number generator. Defaults to 42.

        Returns:
            Tuple[XYData, XYData, XYData, XYData]: A tuple containing (X_train, X_test, y_train, y_test),
            each wrapped in an XYData instance.

        Example:
            ```python
            data = XYData.mock(np.random.rand(100, 5))
            labels = XYData.mock(np.random.randint(0, 2, 100))
            X_train, X_test, y_train, y_test = data.train_test_split(data.value, labels.value, test_size=0.2)
            ```
        """
        X_train, X_test, y_train, y_test = train_test_split(
            x, y, test_size=test_size, random_state=random_state
        )

        return (
            XYData.mock(X_train, hash=f"{self._hash} X train", path="/dataset"),
            XYData.mock(X_test, hash=f"{self._hash} X test", path="/dataset"),
            XYData.mock(y_train, hash=f"{self._hash} y train", path="/dataset"),
            XYData.mock(y_test, hash=f"{self._hash} y test", path="/dataset"),
        )

    def split(self, indices: Iterable[int]) -> XYData:
        """
        Split the data into a new XYData instance with the specified indices.

        This method creates a new XYData instance containing only the data
        corresponding to the provided indices.

        Args:
            indices (Iterable[int]): The indices to select from the data.

        Returns:
            XYData: A new XYData instance containing the selected data.

        Example:
            ```python
            data = XYData.mock(np.random.rand(100, 5))
            subset = data.split(range(50, 100))  # Select second half of the data
            ```
        """

        def split_data(self, indices: Iterable[int]) -> Any:
            value = self.value
            if isinstance(value, spmatrix):
                value = csr_matrix(value)

            return cast(spmatrix, value[indices])

        indices_hash = hashlib.sha1(str(indices).encode()).hexdigest()
        return XYData(
            _hash=f"{self._hash}[{indices_hash}]",
            _path=self._path,
            _value=lambda: split_data(self, indices),
        )

    @staticmethod
    def mock(
        value: TxyData | Callable[..., TxyData],
        hash: str | None = None,
        path: str | None = None,
    ) -> XYData:
        """
        Create a mock XYData instance for testing or placeholder purposes.

        This static method allows for easy creation of XYData instances,
        particularly useful in testing scenarios or when placeholder data is needed.

        Args:
            value (TxyData | Callable[..., TxyData]): The data or a callable that returns the data.
            hash (str | None, optional): A hash string for the data. Defaults to "Mock" if None.
            path (str | None, optional): A path string for the data. Defaults to "/tmp" if None.

        Returns:
            XYData: A new XYData instance with the provided or default values.

        Example:
            ```python
            mock_data = XYData.mock(np.random.rand(10, 5), hash="test_data", path="/data/test")
            ```
        """
        if hash is None:
            hash = "Mock"

        if path is None:
            path = "/tmp"

        return XYData(_hash=hash, _path=path, _value=value)

    @property
    def value(self) -> TxyData:
        """
        Property to access the actual data.

        This property ensures that if _value is a callable, it is called to retrieve the data.
        Otherwise, it returns the data directly.

        Returns:
            TxyData: The actual data (numpy array, pandas DataFrame, scipy sparse matrix, etc.).

        Note:
            This property may modify the _value attribute if it's initially a callable.
        """
        self._value = self._value() if callable(self._value) else self._value
        return self._value

    @staticmethod
    def concat(x: list[TxyData], axis: int = -1) -> XYData:
        """
        Concatenate a list of data along the specified axis.

        This static method handles concatenation for various data types,
        including sparse matrices and other array-like structures.

        Args:
            x (list[TxyData]): List of data to concatenate.
            axis (int, optional): Axis along which to concatenate. Defaults to -1.

        Returns:
            XYData: A new XYData instance with the concatenated data.

        Raises:
            ValueError: If an invalid axis is specified for sparse matrix concatenation.

        Example:
            ```python
            data1 = np.random.rand(10, 5)
            data2 = np.random.rand(10, 5)
            combined = XYData.concat([data1, data2], axis=1)
            ```
        """
        if all(isinstance(item, spmatrix) for item in x):
            if axis == 1:
                return XYData.mock(value=cast(spmatrix, hstack(x)))
            elif axis == 0:
                return XYData.mock(value=cast(spmatrix, vstack(x)))
            raise ValueError("Invalid axis for concatenating sparse matrices")
        return concat(x, axis=axis)

    @staticmethod
    def ensure_dim(x: list | np.ndarray) -> list | np.ndarray:
        """
        Ensure the input data has at least two dimensions.

        This static method is a wrapper around the ensure_dim function,
        which adds a new axis to 1D arrays or lists.

        Args:
            x (list | np.ndarray): Input data to ensure dimensions.

        Returns:
            list | np.ndarray: Data with at least two dimensions.

        Example:
            ```python
            data = [1, 2, 3, 4, 5]
            two_dim_data = XYData.ensure_dim(data)
            ```
        """
        return ensure_dim(x)

    def as_iterable(self) -> Iterable:
        """
        Convert the `_value` attribute to an iterable, regardless of its underlying type.

        This method provides a consistent way to iterate over the data,
        handling different data types appropriately.

        Returns:
            Iterable: An iterable version of `_value`.

        Raises:
            TypeError: If the value type is not compatible with iteration.

        Example:
            ```python
            data = XYData.mock(np.random.rand(10, 5))
            for item in data.as_iterable():
                print(item)
            ```
        """
        value = self.value

        # Maneja diferentes tipos de datos
        if isinstance(value, np.ndarray):
            return value  # Los arrays numpy ya son iterables
        elif isinstance(value, pd.DataFrame):
            return value.iterrows()  # Devuelve un iterable sobre las filas
        elif isinstance(value, spmatrix):
            return value.toarray()  # type: ignore # Convierte la matriz dispersa a un array denso
        elif isinstance(value, torch.Tensor):
            return value
        else:
            raise TypeError(f"El tipo {type(value)} no es compatible con iteración.")
value property

Property to access the actual data.

This property ensures that if _value is a callable, it is called to retrieve the data. Otherwise, it returns the data directly.

Returns:

Name Type Description
TxyData TxyData

The actual data (numpy array, pandas DataFrame, scipy sparse matrix, etc.).

Note

This property may modify the _value attribute if it's initially a callable.

__init__(_hash, _path, _value)
as_iterable()

Convert the _value attribute to an iterable, regardless of its underlying type.

This method provides a consistent way to iterate over the data, handling different data types appropriately.

Returns:

Name Type Description
Iterable Iterable

An iterable version of _value.

Raises:

Type Description
TypeError

If the value type is not compatible with iteration.

Example
data = XYData.mock(np.random.rand(10, 5))
for item in data.as_iterable():
    print(item)
Source code in framework3/base/base_types.py
def as_iterable(self) -> Iterable:
    """
    Convert the `_value` attribute to an iterable, regardless of its underlying type.

    This method provides a consistent way to iterate over the data,
    handling different data types appropriately.

    Returns:
        Iterable: An iterable version of `_value`.

    Raises:
        TypeError: If the value type is not compatible with iteration.

    Example:
        ```python
        data = XYData.mock(np.random.rand(10, 5))
        for item in data.as_iterable():
            print(item)
        ```
    """
    value = self.value

    # Maneja diferentes tipos de datos
    if isinstance(value, np.ndarray):
        return value  # Los arrays numpy ya son iterables
    elif isinstance(value, pd.DataFrame):
        return value.iterrows()  # Devuelve un iterable sobre las filas
    elif isinstance(value, spmatrix):
        return value.toarray()  # type: ignore # Convierte la matriz dispersa a un array denso
    elif isinstance(value, torch.Tensor):
        return value
    else:
        raise TypeError(f"El tipo {type(value)} no es compatible con iteración.")
concat(x, axis=-1) staticmethod

Concatenate a list of data along the specified axis.

This static method handles concatenation for various data types, including sparse matrices and other array-like structures.

Parameters:

Name Type Description Default
x list[TxyData]

List of data to concatenate.

required
axis int

Axis along which to concatenate. Defaults to -1.

-1

Returns:

Name Type Description
XYData XYData

A new XYData instance with the concatenated data.

Raises:

Type Description
ValueError

If an invalid axis is specified for sparse matrix concatenation.

Example
data1 = np.random.rand(10, 5)
data2 = np.random.rand(10, 5)
combined = XYData.concat([data1, data2], axis=1)
Source code in framework3/base/base_types.py
@staticmethod
def concat(x: list[TxyData], axis: int = -1) -> XYData:
    """
    Concatenate a list of data along the specified axis.

    This static method handles concatenation for various data types,
    including sparse matrices and other array-like structures.

    Args:
        x (list[TxyData]): List of data to concatenate.
        axis (int, optional): Axis along which to concatenate. Defaults to -1.

    Returns:
        XYData: A new XYData instance with the concatenated data.

    Raises:
        ValueError: If an invalid axis is specified for sparse matrix concatenation.

    Example:
        ```python
        data1 = np.random.rand(10, 5)
        data2 = np.random.rand(10, 5)
        combined = XYData.concat([data1, data2], axis=1)
        ```
    """
    if all(isinstance(item, spmatrix) for item in x):
        if axis == 1:
            return XYData.mock(value=cast(spmatrix, hstack(x)))
        elif axis == 0:
            return XYData.mock(value=cast(spmatrix, vstack(x)))
        raise ValueError("Invalid axis for concatenating sparse matrices")
    return concat(x, axis=axis)
ensure_dim(x) staticmethod

Ensure the input data has at least two dimensions.

This static method is a wrapper around the ensure_dim function, which adds a new axis to 1D arrays or lists.

Parameters:

Name Type Description Default
x list | ndarray

Input data to ensure dimensions.

required

Returns:

Type Description
list | ndarray

list | np.ndarray: Data with at least two dimensions.

Example
data = [1, 2, 3, 4, 5]
two_dim_data = XYData.ensure_dim(data)
Source code in framework3/base/base_types.py
@staticmethod
def ensure_dim(x: list | np.ndarray) -> list | np.ndarray:
    """
    Ensure the input data has at least two dimensions.

    This static method is a wrapper around the ensure_dim function,
    which adds a new axis to 1D arrays or lists.

    Args:
        x (list | np.ndarray): Input data to ensure dimensions.

    Returns:
        list | np.ndarray: Data with at least two dimensions.

    Example:
        ```python
        data = [1, 2, 3, 4, 5]
        two_dim_data = XYData.ensure_dim(data)
        ```
    """
    return ensure_dim(x)
mock(value, hash=None, path=None) staticmethod

Create a mock XYData instance for testing or placeholder purposes.

This static method allows for easy creation of XYData instances, particularly useful in testing scenarios or when placeholder data is needed.

Parameters:

Name Type Description Default
value TxyData | Callable[..., TxyData]

The data or a callable that returns the data.

required
hash str | None

A hash string for the data. Defaults to "Mock" if None.

None
path str | None

A path string for the data. Defaults to "/tmp" if None.

None

Returns:

Name Type Description
XYData XYData

A new XYData instance with the provided or default values.

Example
mock_data = XYData.mock(np.random.rand(10, 5), hash="test_data", path="/data/test")
Source code in framework3/base/base_types.py
@staticmethod
def mock(
    value: TxyData | Callable[..., TxyData],
    hash: str | None = None,
    path: str | None = None,
) -> XYData:
    """
    Create a mock XYData instance for testing or placeholder purposes.

    This static method allows for easy creation of XYData instances,
    particularly useful in testing scenarios or when placeholder data is needed.

    Args:
        value (TxyData | Callable[..., TxyData]): The data or a callable that returns the data.
        hash (str | None, optional): A hash string for the data. Defaults to "Mock" if None.
        path (str | None, optional): A path string for the data. Defaults to "/tmp" if None.

    Returns:
        XYData: A new XYData instance with the provided or default values.

    Example:
        ```python
        mock_data = XYData.mock(np.random.rand(10, 5), hash="test_data", path="/data/test")
        ```
    """
    if hash is None:
        hash = "Mock"

    if path is None:
        path = "/tmp"

    return XYData(_hash=hash, _path=path, _value=value)
split(indices)

Split the data into a new XYData instance with the specified indices.

This method creates a new XYData instance containing only the data corresponding to the provided indices.

Parameters:

Name Type Description Default
indices Iterable[int]

The indices to select from the data.

required

Returns:

Name Type Description
XYData XYData

A new XYData instance containing the selected data.

Example
data = XYData.mock(np.random.rand(100, 5))
subset = data.split(range(50, 100))  # Select second half of the data
Source code in framework3/base/base_types.py
def split(self, indices: Iterable[int]) -> XYData:
    """
    Split the data into a new XYData instance with the specified indices.

    This method creates a new XYData instance containing only the data
    corresponding to the provided indices.

    Args:
        indices (Iterable[int]): The indices to select from the data.

    Returns:
        XYData: A new XYData instance containing the selected data.

    Example:
        ```python
        data = XYData.mock(np.random.rand(100, 5))
        subset = data.split(range(50, 100))  # Select second half of the data
        ```
    """

    def split_data(self, indices: Iterable[int]) -> Any:
        value = self.value
        if isinstance(value, spmatrix):
            value = csr_matrix(value)

        return cast(spmatrix, value[indices])

    indices_hash = hashlib.sha1(str(indices).encode()).hexdigest()
    return XYData(
        _hash=f"{self._hash}[{indices_hash}]",
        _path=self._path,
        _value=lambda: split_data(self, indices),
    )
train_test_split(x, y, test_size, random_state=42)

Split the data into training and testing sets.

This method uses sklearn's train_test_split function to divide the data into training and testing sets for both features (X) and targets (Y).

Parameters:

Name Type Description Default
x TxyData

The feature data to split.

required
y TxyData | None

The target data to split. Can be None for unsupervised learning.

required
test_size float

The proportion of the data to include in the test split (0.0 to 1.0).

required
random_state int

Seed for the random number generator. Defaults to 42.

42

Returns:

Type Description
XYData

Tuple[XYData, XYData, XYData, XYData]: A tuple containing (X_train, X_test, y_train, y_test),

XYData

each wrapped in an XYData instance.

Example
data = XYData.mock(np.random.rand(100, 5))
labels = XYData.mock(np.random.randint(0, 2, 100))
X_train, X_test, y_train, y_test = data.train_test_split(data.value, labels.value, test_size=0.2)
Source code in framework3/base/base_types.py
def train_test_split(
    self, x: TxyData, y: TxyData | None, test_size: float, random_state: int = 42
) -> Tuple[XYData, XYData, XYData, XYData]:
    """
    Split the data into training and testing sets.

    This method uses sklearn's train_test_split function to divide the data
    into training and testing sets for both features (X) and targets (Y).

    Args:
        x (TxyData): The feature data to split.
        y (TxyData | None): The target data to split. Can be None for unsupervised learning.
        test_size (float): The proportion of the data to include in the test split (0.0 to 1.0).
        random_state (int, optional): Seed for the random number generator. Defaults to 42.

    Returns:
        Tuple[XYData, XYData, XYData, XYData]: A tuple containing (X_train, X_test, y_train, y_test),
        each wrapped in an XYData instance.

    Example:
        ```python
        data = XYData.mock(np.random.rand(100, 5))
        labels = XYData.mock(np.random.randint(0, 2, 100))
        X_train, X_test, y_train, y_test = data.train_test_split(data.value, labels.value, test_size=0.2)
        ```
    """
    X_train, X_test, y_train, y_test = train_test_split(
        x, y, test_size=test_size, random_state=random_state
    )

    return (
        XYData.mock(X_train, hash=f"{self._hash} X train", path="/dataset"),
        XYData.mock(X_test, hash=f"{self._hash} X test", path="/dataset"),
        XYData.mock(y_train, hash=f"{self._hash} y train", path="/dataset"),
        XYData.mock(y_test, hash=f"{self._hash} y test", path="/dataset"),
    )

_(x)

Ensure that a list has at least two dimensions by converting it to a numpy array.

Parameters:

Name Type Description Default
x list

Input list.

required

Returns:

Name Type Description
SkVData SkVData

A numpy array with at least two dimensions.

Source code in framework3/base/base_types.py
@ensure_dim.register  # type: ignore
def _(x: list) -> SkVData:
    """
    Ensure that a list has at least two dimensions by converting it to a numpy array.

    Args:
        x (list): Input list.

    Returns:
        SkVData: A numpy array with at least two dimensions.
    """
    return ensure_dim(np.array(x))

concat(x, axis)

Base multimethod for concatenation. Raises an error for unsupported types.

Parameters:

Name Type Description Default
x Any

Data to concatenate.

required
axis int

Axis along which to concatenate.

required

Raises:

Type Description
TypeError

Always raised as this is the base method for unsupported types.

Source code in framework3/base/base_types.py
@multimethod
def concat(x: Any, axis: int) -> "XYData":
    """
    Base multimethod for concatenation. Raises an error for unsupported types.

    Args:
        x (Any): Data to concatenate.
        axis (int): Axis along which to concatenate.

    Raises:
        TypeError: Always raised as this is the base method for unsupported types.
    """
    raise TypeError(f"Cannot concatenate this type of data, only {VData} compatible")

ensure_dim(x)

Base multimethod for ensuring dimensions. Raises an error for unsupported types.

Parameters:

Name Type Description Default
x Any

Data to ensure dimensions for.

required

Raises:

Type Description
TypeError

Always raised as this is the base method for unsupported types.

Source code in framework3/base/base_types.py
@multimethod
def ensure_dim(x: Any) -> SkVData | VData:
    """
    Base multimethod for ensuring dimensions. Raises an error for unsupported types.

    Args:
        x (Any): Data to ensure dimensions for.

    Raises:
        TypeError: Always raised as this is the base method for unsupported types.
    """
    raise TypeError(
        f"Cannot concatenate this type of data, only {VData} or {SkVData} compatible"
    )