lamindb.Collection¶

Bases: SQLRecord, IsVersioned, TracksRun, TracksUpdates

Versioned collections of artifacts.

Parameters:

artifacts – Artifact | list[Artifact] One or several artifacts.
key – str A file-path like key, analogous to the key parameter of Artifact and Transform.
description – str | None = None A description.
meta – Artifact | None = None An artifact that defines metadata for the collection.
reference – str | None = None A simple reference, e.g. an external ID or a URL.
reference_type – str | None = None A way to indicate to indicate the type of the simple reference "url".
run – Run | None = None The run that creates the collection.
revises – Collection | None = None An old version of the collection.
skip_hash_lookup – bool = False Skip the hash lookup so that a new collection is created even if a collection with the same hash already exists.

See also

Artifact

Examples

Create a collection from a list of Artifact objects:

collection = ln.Collection([artifact1, artifact2], key="my_project/my_collection")

Create a collection that groups a data & a metadata artifact (e.g., here RxRx: cell imaging):

collection = ln.Collection(data_artifact, key="my_project/my_collection", meta=metadata_artifact)

Attributes¶

property data_artifact: Artifact | None¶

Access to a single data artifact.

If the collection has a single data & metadata artifact, this allows access via:

collection.data_artifact  # first & only element of collection.artifacts
collection.meta_artifact  # metadata

property name: str¶

Name of the collection.

Splits key on / and returns the last element.

property ordered_artifacts: QuerySet¶

Ordered QuerySet of .artifacts.

Accessing the many-to-many field collection.artifacts directly gives you non-deterministic order.

Using the property .ordered_artifacts allows to iterate through a set that’s ordered by the order of the list that created the collection.

property stem_uid: str¶

Universal id characterizing the version family.

The full uid of a record is obtained via concatenating the stem uid and version information:

stem_uid = random_base62(n_char)  # a random base62 sequence of length 12 (transform) or 16 (artifact, collection)
version_uid = "0000"  # an auto-incrementing 4-digit base62 number
uid = f"{stem_uid}{version_uid}"  # concatenate the stem_uid & version_uid

property transform: Transform | None¶: Transform whose run created the collection.

property versions: QuerySet¶

Lists all records of the same version family.

>>> new_artifact = ln.Artifact(df2, revises=artifact).save()
>>> new_artifact.versions()

Simple fields¶

uid: str¶: Universal id, valid across DB instances.

key: str¶: Name or path-like key.

description: str | None¶: A description or title.

hash: str | None¶: Hash of collection content.

reference: str | None¶: A reference like URL or external ID.

reference_type: str | None¶: Type of reference, e.g., cellxgene Census collection_id.

meta_artifact: Artifact | None¶

An artifact that stores metadata that indexes a collection.

It has a 1:1 correspondence with an artifact. If needed, you can access the collection from the artifact via a private field: artifact._meta_of_collection.

version: str | None¶

Version (default None).

Defines version of a family of records characterized by the same stem_uid.

Consider using semantic versioning with Python versioning.

is_latest: bool¶: Boolean flag that indicates whether a record is the latest in its version family.

is_locked: bool¶: Whether the record is locked for edits.

created_at: datetime¶: Time of creation of record.

updated_at: datetime¶: Time of last update to record.

Relational fields¶

branch: Branch¶

Life cycle state of record.

branch.name can be “main” (default branch), “trash” (trash), branch.name = "archive" (archived), or any other user-created branch typically planned for merging onto main after review.

space: Space¶: The space in which the record lives.

created_by: User¶: Creator of record.

run: Run | None¶: Run that created the collection.

ulabels: ULabel¶: ULabels annotating the collection (see Feature).

input_of_runs: Run¶: Runs that use this collection as an input.

artifacts: Artifact¶: Artifacts in collection.

linked_in_records: Record¶: This collection is linked in these records as a value.

records: Record¶: Linked records.

references: Reference¶: Linked references.

projects: Project¶: Linked projects.

blocks: CollectionBlock¶: Blocks that annotate this collection.

Class methods¶

classmethod get(idlike=None, *, is_run_input=False, **expressions)¶

Get a single collection.

Parameters:

idlike (int | str | None, default: None) – Either a uid stub, uid or an integer id.
is_run_input (bool | Run, default: False) – Whether to track this collection as run input.
expressions – Fields and values passed as Django query expressions.

Raises:

lamindb.errors.DoesNotExist – In case no matching record is found.

Return type:

Artifact

See also

Method in SQLRecord base class: get()

Examples

collection = ln.Collection.get("okxPW6GIKBfRBE3B0000")
collection = ln.Collection.get(key="scrna/collection1")

classmethod filter(*queries, **expressions)¶

Query records.

Parameters:

queries – One or multiple Q objects.
expressions – Fields and values passed as Django query expressions.

Return type:

QuerySet

See also

Guide: Query & search registries
Django documentation: Queries

Examples

>>> ln.Project(name="my label").save()
>>> ln.Project.filter(name__startswith="my").to_dataframe()

classmethod to_dataframe(include=None, features=False, limit=100)¶

Evaluate and convert to pd.DataFrame.

By default, maps simple fields and foreign keys onto DataFrame columns.

Guide: Query & search registries

Parameters:

include (str | list[str] | None, default: None) – Related data to include as columns. Takes strings of form "records__name", "cell_types__name", etc. or a list of such strings. For Artifact, Record, and Run, can also pass "features" to include features with data types pointing to entities in the core schema. If "privates", includes private fields (fields starting with _).
features (bool | list[str], default: False) – Configure the features to include. Can be a feature name or a list of such names. If "queryset", infers the features used within the current queryset. Only available for Artifact, Record, and Run.
limit (int, default: 100) – Maximum number of rows to display. If None, includes all results.
order_by – Field name to order the records by. Prefix with ‘-’ for descending order. Defaults to ‘-id’ to get the most recent records. This argument is ignored if the queryset is already ordered or if the specified field does not exist.

Return type:

DataFrame

Examples

Include the name of the creator:

ln.Record.to_dataframe(include="created_by__name"])

Include features:

ln.Artifact.to_dataframe(include="features")

Include selected features:

ln.Artifact.to_dataframe(features=["cell_type_by_expert", "cell_type_by_model"])

classmethod search(string, *, field=None, limit=20, case_sensitive=False)¶

Search.

Parameters:

string (str) – The input string to match against the field ontology values.
field (str | DeferredAttribute | None, default: None) – The field or fields to search. Search all string fields by default.
limit (int | None, default: 20) – Maximum amount of top results to return.
case_sensitive (bool, default: False) – Whether the match is case sensitive.

Return type:

QuerySet

Returns:

A sorted DataFrame of search results with a score in column score. If return_queryset is True. QuerySet.

See also

filter() lookup()

Examples

records = ln.Record.from_values(["Label1", "Label2", "Label3"], field="name").save()
ln.Record.search("Label2")

classmethod lookup(field=None, return_field=None)¶

Return an auto-complete object for a field.

Parameters:

field (str | DeferredAttribute | None, default: None) – The field to look up the values for. Defaults to first string field.
return_field (str | DeferredAttribute | None, default: None) – The field to return. If None, returns the whole record.
keep – When multiple records are found for a lookup, how to return the records. - "first": return the first record. - "last": return the last record. - False: return all records.

Return type:

NamedTuple

Returns:

A NamedTuple of lookup information of the field values with a dictionary converter.

See also

search()

Examples

Lookup via auto-complete on .:

import bionty as bt
bt.Gene.from_source(symbol="ADGB-DT").save()
lookup = bt.Gene.lookup()
lookup.adgb_dt

Look up via auto-complete in dictionary:

lookup_dict = lookup.dict()
lookup_dict['ADGB-DT']

Look up via a specific field:

lookup_by_ensembl_id = bt.Gene.lookup(field="ensembl_gene_id")
genes.ensg00000002745

Return a specific field value instead of the full record:

lookup_return_symbols = bt.Gene.lookup(field="ensembl_gene_id", return_field="symbol")

classmethod connect(instance)¶

Query a non-default LaminDB instance.

Parameters:: instance (str | None) – An instance identifier of form “account_handle/instance_name”.
Return type:: QuerySet

Examples

ln.Record.connect("account_handle/instance_name").search("label7", field="name")

Methods¶

append(artifact, run=None)¶

Append an artifact to the collection.

This does not modify the original collection in-place, but returns a new version of the original collection with the appended artifact.

Parameters:

artifact (Artifact) – An artifact to add to the collection.
run (Run | None, default: None) – The run that creates the new version of the collection.

Return type:

Collection

Examples

collection_v1 = ln.Collection(artifact, key="My collection").save()
collection_v2 = collection.append(another_artifact)  # returns a new version of the collection
collection_v2.save()  # save the new version

open(engine='pyarrow', is_run_input=None, **kwargs)¶

Open a dataset for streaming.

Works for pyarrow and polars compatible formats (.parquet, .csv, .ipc etc. files or directories with such files).

Parameters:

engine (Literal['pyarrow', 'polars'], default: 'pyarrow') – Which module to use for lazy loading of a dataframe from pyarrow or polars compatible formats.
is_run_input (bool | None, default: None) – Whether to track this artifact as run input.
**kwargs – Keyword arguments for pyarrow.dataset.dataset or polars.scan_* functions.

Return type:

Dataset | Iterator[LazyFrame]

Notes

For more info, see guide: Slice & stream arrays.

mapped(layers_keys=None, obs_keys=None, obsm_keys=None, obs_filter=None, join='inner', encode_labels=True, unknown_label=None, cache_categories=True, parallel=False, dtype=None, stream=False, is_run_input=None)¶

Return a map-style dataset.

Returns a pytorch map-style dataset by virtually concatenating AnnData arrays.

By default (stream=False) AnnData arrays are moved into a local cache first.

__getitem__ of the MappedCollection object takes a single integer index and returns a dictionary with the observation data sample for this index from the AnnData objects in the collection. The dictionary has keys for layers_keys (.X is in "X"), obs_keys, obsm_keys (under f"obsm_{key}") and also "_store_idx" for the index of the AnnData object containing this observation sample.

Note

For a guide, see Train a machine learning model on a collection.

This method currently only works for collections or query sets of AnnData artifacts.

Parameters:

layers_keys (str | list[str] | None, default: None) – Keys from the .layers slot. layers_keys=None or "X" in the list retrieves .X.
obs_keys (str | list[str] | None, default: None) – Keys from the .obs slots.
obsm_keys (str | list[str] | None, default: None) – Keys from the .obsm slots.
obs_filter (dict[str, str | list[str]] | None, default: None) – Select only observations with these values for the given obs columns. Should be a dictionary with obs column names as keys and filtering values (a string or a list of strings) as values.
join (Literal['inner', 'outer'] | None, default: 'inner') – "inner" or "outer" virtual joins. If None is passed, does not join.
encode_labels (bool | list[str], default: True) – Encode labels into integers. Can be a list with elements from obs_keys.
unknown_label (str | dict[str, str] | None, default: None) – Encode this label to -1. Can be a dictionary with keys from obs_keys if encode_labels=True or from encode_labels if it is a list.
cache_categories (bool, default: True) – Enable caching categories of obs_keys for faster access.
parallel (bool, default: False) – Enable sampling with multiple processes.
dtype (str | None, default: None) – Convert numpy arrays from .X, .layers and .obsm
stream (bool, default: False) – Whether to stream data from the array backend.
is_run_input (bool | None, default: None) – Whether to track this collection as run input.

Return type:

MappedCollection

Examples

>>> import lamindb as ln
>>> from torch.utils.data import DataLoader
>>> ds = ln.Collection.get(description="my collection")
>>> mapped = collection.mapped(obs_keys=["cell_type", "batch"])
>>> dl = DataLoader(mapped, batch_size=128, shuffle=True)
>>> # also works for query sets of artifacts, '...' represents some filtering condition
>>> # additional filtering on artifacts of the collection
>>> mapped = collection.artifacts.all().filter(...).order_by("-created_at").mapped()
>>> # or directly from a query set of artifacts
>>> mapped = ln.Artifact.filter(..., otype="AnnData").order_by("-created_at").mapped()

cache(is_run_input=None)¶

Download cloud artifacts in collection to local cache.

Follows syncing logic: only downloads outdated artifacts.

Returns ordered paths to locally cached on-disk artifacts via .ordered_artifacts.all():

Parameters:: is_run_input (bool | None, default: None) – Whether to track this collection as run input.
Return type:: list[UPath]

load(join='outer', is_run_input=None, **kwargs)¶

Cache and load to memory.

Returns an in-memory concatenated DataFrame or AnnData object.

Return type:: DataFrame | AnnData

save(using=None)¶

Save the collection and underlying artifacts to database & storage.

Parameters:: using (str | None, default: None) – The database to which you want to save.
Return type:: Collection

Examples

>>> collection = ln.Collection("./myfile.csv", name="myfile")

restore()¶

Restore collection record from trash.

Return type:: None

Examples

For any Collection object collection, call:

>>> collection.restore()

describe(return_str=False)¶

Describe record including relations.

Parameters:: return_str (bool, default: False) – Return a string instead of printing.
Return type:: None | str

view_lineage(with_children=True, return_graph=False)¶

View data lineage graph.

Return type:: Digraph | None

delete(permanent=None, **kwargs)¶

Delete record.

If record is HasType with is_type = True, deletes all descendant records, too.

Parameters:: permanent (bool | None, default: None) – Whether to permanently delete the record (skips trash). If None, performs soft delete if the record is not already in the trash.
Return type:: None

Examples

For any SQLRecord object record, call:

>>> record.delete()

refresh_from_db(using=None, fields=None, from_queryset=None)¶

Reload field values from the database.

By default, the reloading happens from the database this instance was loaded from, or by the read router if this instance wasn’t loaded from any database. The using parameter will override the default.

Fields can be used to specify which fields to reload. The fields should be an iterable of field attnames. If fields is None, then all non-deferred fields are reloaded.

When accessing deferred fields of an instance, the deferred loading of the field will call this method.

async arefresh_from_db(using=None, fields=None, from_queryset=None)¶