Jupyter Notebook

Validate & register flow cytometry data#

Flow cytometry is a technique used to analyze and sort cells or particles based on their physical and chemical characteristics as they flow in a fluid stream through a laser beam.

Here, we’ll transform, validate and register two flow cytometry datasets (Alpert19 and FlowIO sample) to demonstrate how to create and query a custom flow cytometry registry.

!lamin init --storage ./test-flow --schema bionty
Hide code cell output
πŸ’‘ creating schemas: core==0.47.3 bionty==0.30.3 
βœ… saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-04 09:35:10)
βœ… saved: Storage(id='Gl194NAt', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-flow', type='local', updated_at=2023-09-04 09:35:10, created_by_id='DzTjkKse')
βœ… loaded instance: testuser1/test-flow
πŸ’‘ did not register local instance on hub (if you want, call `lamin register`)

import lamindb as ln
import lnschema_bionty as lb
import readfcs

lb.settings.species = "human"
βœ… loaded instance: testuser1/test-flow (lamindb 0.52.1)
ln.track()
πŸ’‘ notebook imports: lamindb==0.52.1 lnschema_bionty==0.30.3 readfcs==1.1.6
βœ… saved: Transform(id='OWuTtS4SAponz8', name='Validate & register flow cytometry data', short_name='facs', version='0', type=notebook, updated_at=2023-09-04 09:35:12, created_by_id='DzTjkKse')
βœ… saved: Run(id='FmS4pujHMyXa11urgfmA', run_at=2023-09-04 09:35:12, transform_id='OWuTtS4SAponz8', created_by_id='DzTjkKse')

Alpert19#

Transform #

(Here we skip steps of data transformations, which often includes filtering, normalizing, or formatting data.)

We start with a flow cytometry file from Alpert19:

ln.dev.datasets.file_fcs_alpert19(
    populate_registries=True,  # pre-populate registries to simulate an used instance
)


PosixPath('Alpert19.fcs')

Use readfcs to read the fcs file into memory:

adata = readfcs.read("Alpert19.fcs")
adata
AnnData object with n_obs Γ— n_vars = 166537 Γ— 40
    var: 'n', 'channel', 'marker', '$PnB', '$PnE', '$PnR'
    uns: 'meta'

Validate #

First, let’s validate the features in .var.

We’ll use the CellMarker reference to link features:

lb.CellMarker.validate(adata.var.index, "name");
βœ… 27 terms (67.50%) are validated for name
❗ 13 terms (32.50%) are not validated for name: Time, Cell_length, Dead, (Ba138)Dd, Bead, CD19, CD4, IgD, CD11b, CD14, CCR6, CCR7, PD-1

We see that many features aren’t validated. Let’s standardize the identifiers first to get rid of synonyms:

adata.var.index = lb.CellMarker.standardize(adata.var.index)
πŸ’‘ standardized 35/40 terms

Great, now we can validate our markers once more:

validated = lb.CellMarker.validate(adata.var.index, "name")
βœ… 35 terms (87.50%) are validated for name
❗ 5 terms (12.50%) are not validated for name: Time, Cell_length, Dead, (Ba138)Dd, Bead

Things look much better, but we still have 5 CellMaker records that seem more like metadata. Hence, let’s curate the AnnData object a bit more.

Let’s move metadata (non-validated cell markers) into adata.obs:

adata.obs = adata[:, ~validated].to_df()
adata = adata[:, validated].copy()

Now we have a clean panel of 35 cell markers:

lb.CellMarker.validate(adata.var.index, "name");
βœ… 35 terms (100.00%) are validated for name

Next, let’s register the metadata features we moved to .obs:

# Feature.from_df creates feature records with type auto-populated
features = ln.Feature.from_df(adata.obs)
ln.add(features)

In addition, We’d also like to link this file with external features:

ln.Feature.validate("assay", "name")
lb.ExperimentalFactor.validate("FACS", "name");
βœ… 1 term (100.00%) is validated for name
❗ 1 term (100.00%) is not validated for name: FACS

Since we never validated the term β€œFACS”, let’s search for it’s ontology and register it:

lb.ExperimentalFactor.bionty().search("FACS").head(2)
ontology_id definition synonyms parents molecule instrument measurement __ratio__
name
fluorescence-activated cell sorting EFO:0009108 A Flow Cytometry Assay That Provides A Method ... FACS|FAC sorting [] None None None 100.000000
acute chest syndrome EFO:0007129 A Vaso-Occlusive Crisis Of The Pulmonary Vascu... ACS|Acute Chest Syndrome|acute chest syndrome|... [EFO:0003818] None None None 85.714286
facs = lb.ExperimentalFactor.from_bionty(ontology_id="EFO:0009108")
facs.save()
βœ… created 1 ExperimentalFactor record from Bionty matching ontology_id: 'EFO:0009108'

Register #

file = ln.File.from_anndata(adata, description="Alpert19", field=lb.CellMarker.name)
πŸ’‘ file will be copied to default storage upon `save()` with key `None` ('.lamindb/o1VABJjJIPLBkYyOTWF3.h5ad')
πŸ’‘ parsing feature names of X stored in slot 'var'
βœ…    35 terms (100.00%) are validated for name
βœ…    linked: FeatureSet(id='KoMiw60AiHAxeGi1qC0J', n=35, type='number', registry='bionty.CellMarker', hash='ldY9_GmptHLCcT7Nrpgo', created_by_id='DzTjkKse')
πŸ’‘ parsing feature names of slot 'obs'
βœ…    5 terms (100.00%) are validated for name
βœ…    linked: FeatureSet(id='q4FBAcjdMw0shx5fzkF8', n=5, registry='core.Feature', hash='oKSnskWJciJRncGiuqVP', modality_id='x1f8LxSS', created_by_id='DzTjkKse')
file.save()
βœ… saved 2 feature sets for slots: 'var','obs'
βœ… storing file 'o1VABJjJIPLBkYyOTWF3' at '.lamindb/o1VABJjJIPLBkYyOTWF3.h5ad'
features = ln.Feature.lookup()
file.add_labels(facs, features.assay)
file.add_labels(lb.settings.species, features.species)
βœ… linked new feature 'assay' together with new feature set FeatureSet(id='Akam6qgtJEwASMpbhkO0', n=1, registry='core.Feature', hash='YY8j0t9Tegc9KlmlHbdl', updated_at=2023-09-04 09:35:18, modality_id='x1f8LxSS', created_by_id='DzTjkKse')
πŸ’‘ no file links to it anymore, deleting feature set FeatureSet(id='Akam6qgtJEwASMpbhkO0', n=1, registry='core.Feature', hash='YY8j0t9Tegc9KlmlHbdl', updated_at=2023-09-04 09:35:18, modality_id='x1f8LxSS', created_by_id='DzTjkKse')
βœ… linked new feature 'species' together with new feature set FeatureSet(id='GCPDLHFsRYLLLxvFfMwM', n=2, registry='core.Feature', hash='ejD62nZpJW0aDQHFgpNn', updated_at=2023-09-04 09:35:18, modality_id='x1f8LxSS', created_by_id='DzTjkKse')
file.features
Features:
  var: FeatureSet(id='KoMiw60AiHAxeGi1qC0J', n=35, type='number', registry='bionty.CellMarker', hash='ldY9_GmptHLCcT7Nrpgo', updated_at=2023-09-04 09:35:18, created_by_id='DzTjkKse')
    CD20 (number)
    TCRgd (number)
    DNA1 (number)
    CD28 (number)
    CD127 (number)
    PD1 (number)
    CD11B (number)
    ICOS (number)
    CD27 (number)
    Ccr7 (number)
    ... 
  obs: FeatureSet(id='q4FBAcjdMw0shx5fzkF8', n=5, registry='core.Feature', hash='oKSnskWJciJRncGiuqVP', updated_at=2023-09-04 09:35:18, modality_id='x1f8LxSS', created_by_id='DzTjkKse')
    Dead (number)
    Cell_length (number)
    (Ba138)Dd (number)
    Bead (number)
    Time (number)
  external: FeatureSet(id='GCPDLHFsRYLLLxvFfMwM', n=2, registry='core.Feature', hash='ejD62nZpJW0aDQHFgpNn', updated_at=2023-09-04 09:35:18, modality_id='x1f8LxSS', created_by_id='DzTjkKse')
    πŸ”— species (1, bionty.Species): 'human'
    πŸ”— assay (1, bionty.ExperimentalFactor): 'fluorescence-activated cell sorting'

Check a few validated cell markers in .var:

file.features["var"].df().head(10)
name synonyms gene_symbol ncbi_gene_id uniprotkb_id species_id bionty_source_id updated_at created_by_id
id
cFJEI6e6wml3 CD20 MS4A1 931 A0A024R507 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
ljp5UfCF9HCi TCRgd TCRGAMMADELTA|TCRΞ³Ξ΄ None None None uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
YA5Ezh6SAy10 DNA1 None None None uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
CLFUvJpioHoA CD28 CD28 940 B4E0L1 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
hVNEgxlcDV10 CD127 IL7R 3575 P16871 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
2VeZenLi2dj5 PD1 PID1|PD-1|PD 1 PDCD1 5133 A0A0M3M0G7 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
N2F6Qv9CxJch CD11B ITGAM 3684 P11215 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
0vAls2cmLKWq ICOS ICOS 29851 Q53QY6 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
uThe3c0V3d4i CD27 CD27 939 P26842 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
sYcK7uoWCtco Ccr7 CCR7 1236 P32248 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse

FlowIO sample#

Let’s transform, validate and register another flow file:

Transform #

There are no further transformations necessary.

adata2 = readfcs.read(ln.dev.datasets.file_fcs())

Validate #

We’d like to track all features in .var, so we register them:

adata2.var.index = lb.CellMarker.bionty().standardize(adata2.var.index)
πŸ’‘ standardized 14/16 terms
markers = lb.CellMarker.from_values(adata2.var.index, "name")
ln.save(markers)
βœ… loaded 10 CellMarker records matching name: 'CD3', 'CD28', 'CD8', 'Cd4', 'CD57', 'Cd14', 'Cd19', 'CD27', 'Ccr7', 'CD127'
βœ… created 4 CellMarker records from Bionty matching name: 'CCR5', 'CD45RO', 'Ki67', 'SSC-A'
❗ did not create CellMarker records for 2 non-validated names: 'FSC-A', 'FSC-H'

Standardize synonyms so that all features pass validation:

adata2.var.index = lb.CellMarker.standardize(adata2.var.index)
πŸ’‘ standardized 14/16 terms
lb.CellMarker.validate(adata2.var.index, "name");
βœ… 14 terms (87.50%) are validated for name
❗ 2 terms (12.50%) are not validated for name: FSC-A, FSC-H

Register #

file2 = ln.File.from_anndata(
    adata2, description="My fcs file", field=lb.CellMarker.name
)
πŸ’‘ file will be copied to default storage upon `save()` with key `None` ('.lamindb/P8Yh74qO8INeQVa1g83P.h5ad')
πŸ’‘ parsing feature names of X stored in slot 'var'
βœ…    14 terms (87.50%) are validated for name
❗    2 terms (12.50%) are not validated for name: FSC-A, FSC-H
βœ…    linked: FeatureSet(id='8QaSMOFPSi5nPd9uFyvl', n=14, type='number', registry='bionty.CellMarker', hash='npy5P7AYbjKLInpXlNvb', created_by_id='DzTjkKse')
file2.save()
βœ… saved 1 feature set for slot: 'var'
βœ… storing file 'P8Yh74qO8INeQVa1g83P' at '.lamindb/P8Yh74qO8INeQVa1g83P.h5ad'
file2.add_labels(facs, features.assay)
file2.add_labels(lb.settings.species, features.species)
βœ… linked new feature 'assay' together with new feature set FeatureSet(id='EWhKX6hhz1Qw4Fl6xN68', n=1, registry='core.Feature', hash='YY8j0t9Tegc9KlmlHbdl', updated_at=2023-09-04 09:35:21, modality_id='x1f8LxSS', created_by_id='DzTjkKse')
βœ… loaded: FeatureSet(id='GCPDLHFsRYLLLxvFfMwM', n=2, registry='core.Feature', hash='ejD62nZpJW0aDQHFgpNn', updated_at=2023-09-04 09:35:18, modality_id='x1f8LxSS', created_by_id='DzTjkKse')
βœ… linked new feature 'species' together with new feature set FeatureSet(id='GCPDLHFsRYLLLxvFfMwM', n=2, registry='core.Feature', hash='ejD62nZpJW0aDQHFgpNn', updated_at=2023-09-04 09:35:21, modality_id='x1f8LxSS', created_by_id='DzTjkKse')
file2.features
Features:
  var: FeatureSet(id='8QaSMOFPSi5nPd9uFyvl', n=14, type='number', registry='bionty.CellMarker', hash='npy5P7AYbjKLInpXlNvb', updated_at=2023-09-04 09:35:21, created_by_id='DzTjkKse')
    Cd14 (number)
    Ccr7 (number)
    Cd4 (number)
    CD3 (number)
    SSC-A (number)
    Cd19 (number)
    CD8 (number)
    Ki67 (number)
    CD57 (number)
    CD28 (number)
    ... 
  external: FeatureSet(id='GCPDLHFsRYLLLxvFfMwM', n=2, registry='core.Feature', hash='ejD62nZpJW0aDQHFgpNn', updated_at=2023-09-04 09:35:21, modality_id='x1f8LxSS', created_by_id='DzTjkKse')
    πŸ”— species (1, bionty.Species): 'human'
    πŸ”— assay (1, bionty.ExperimentalFactor): 'fluorescence-activated cell sorting'
file2.view_flow()
https://d33wubrfki0l68.cloudfront.net/92d0caef620dc5fcae1941481b4795e56215b553/7d23f/_images/80d6a1b91c305aa99fab24170da87c17ce5618d7c68f19e9f3607ec2f1cdd815.svg

Query by cell markers #

Which datasets have CD14 in the flow panel:

cell_markers = lb.CellMarker.lookup()
cell_markers.cd14
CellMarker(id='roEbL8zuLC5k', name='Cd14', synonyms='', gene_symbol='CD14', ncbi_gene_id='4695', uniprotkb_id='O43678', updated_at=2023-09-04 09:35:15, species_id='uHJU', bionty_source_id='IfAX', created_by_id='DzTjkKse')
panels_with_cd14 = ln.FeatureSet.filter(cell_markers=cell_markers.cd14).all()
ln.File.filter(feature_sets__in=panels_with_cd14).df()
storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id updated_at created_by_id
id
P8Yh74qO8INeQVa1g83P Gl194NAt None .h5ad AnnData My fcs file None 6876232 Cf4Fhfw_RDMtKd5amM6Gtw md5 OWuTtS4SAponz8 FmS4pujHMyXa11urgfmA None 2023-09-04 09:35:21 DzTjkKse
o1VABJjJIPLBkYyOTWF3 Gl194NAt None .h5ad AnnData Alpert19 None 33367624 14w5ElNsR_MqdiJtvnS1aw md5 OWuTtS4SAponz8 FmS4pujHMyXa11urgfmA None 2023-09-04 09:35:18 DzTjkKse

Shared cell markers between two files:

files = ln.File.filter(feature_sets__in=panels_with_cd14, species__name="human").list()
file1, file2 = files[0], files[1]
file1_markers = file1.features["var"]
file2_markers = file2.features["var"]

shared_markers = file1_markers & file2_markers
shared_markers.list("name")
['Cd14', 'Ccr7', 'Cd4', 'CD3', 'Cd19', 'CD8', 'CD57', 'CD28', 'CD127', 'CD27']

Flow marker registry#

Check out your CellMarker registry:

lb.CellMarker.filter().df()
name synonyms gene_symbol ncbi_gene_id uniprotkb_id species_id bionty_source_id updated_at created_by_id
id
cFJEI6e6wml3 CD20 MS4A1 931 A0A024R507 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
ljp5UfCF9HCi TCRgd TCRGAMMADELTA|TCRΞ³Ξ΄ None None None uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
YA5Ezh6SAy10 DNA1 None None None uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
CLFUvJpioHoA CD28 CD28 940 B4E0L1 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
hVNEgxlcDV10 CD127 IL7R 3575 P16871 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
2VeZenLi2dj5 PD1 PID1|PD-1|PD 1 PDCD1 5133 A0A0M3M0G7 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
N2F6Qv9CxJch CD11B ITGAM 3684 P11215 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
0vAls2cmLKWq ICOS ICOS 29851 Q53QY6 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
uThe3c0V3d4i CD27 CD27 939 P26842 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
sYcK7uoWCtco Ccr7 CCR7 1236 P32248 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
HEK41hvaIazP Cd4 CD4 920 B4DT49 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
CR7DAHxybgyi CD38 CD38 952 B4E006 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
8OhpfB7wwV32 Cd19 CD19 930 P15391 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
h4rkCALR5WfU CD56 NCAM1 4684 P13591 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
0qCmUijBeByY CD94 KLRD1 3824 Q13241 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
k0zGbSgZEX3q HLADR HLA‐DR|HLA-DR|HLA DR None None None uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
gEfe8qTsIHl0 CD24 CD24 100133941 B6EC88 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
yCyTIVxZkIUz DNA2 DNA2 1763 P51530 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
roEbL8zuLC5k Cd14 CD14 4695 O43678 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
a624IeIqbchl CD45RA None None None uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
a4hvNp34IYP0 CD3 None None None uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
L0m6f7FPiDeg CD86 CD86 942 A8K632 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
ttBc0Fs01sYk CD8 CD8A 925 P01732 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
fpPkjlGv15C9 Ccr6 CCR6 1235 P51684 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
0evamYEdmaoY Igd None None None uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
c3dZKHFOdllB CD33 CD33 945 P20138 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
4EojtgN0CjBH CD161 KLRB1 3820 Q12918 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
lRZYuH929QDw CD85j None None None uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
agQD0dEzuoNA CXCR3 CXCR3 2833 P49682 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
n40112OuX7Cq CD123 IL3RA 3563 P26951 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
4uiPHmCPV5i1 CXCR5 CXCR5 643 A0N0R2 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
L0WKZ3fufq0J CD11c ITGAX 3687 P20702 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
Nb2sscq9cBcB CD57 B3GAT1 27087 Q9P2W7 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
bspnQ0igku6c CD16 FCGR3A 2215 O75015 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
50v4SaR2m5zQ CD25 IL2RA 3559 P01589 uHJU IfAX 2023-09-04 09:35:15 DzTjkKse
XvpJ6oL3SG7w CD45RO None None None uHJU IfAX 2023-09-04 09:35:20 DzTjkKse
VZBURNy04vBi SSC-A SSC A|SSCA None None None uHJU IfAX 2023-09-04 09:35:20 DzTjkKse
Qa4ozz9tyesQ Ki67 Ki-67|KI 67 None None None uHJU IfAX 2023-09-04 09:35:20 DzTjkKse
UMsp5g0fgMwY CCR5 CCR5 1234 P51681 uHJU IfAX 2023-09-04 09:35:20 DzTjkKse
Hide code cell content
# a few tests
assert set(shared_markers.list("name")) == set(
    [
        "Ccr7",
        "CD3",
        "Cd14",
        "Cd19",
        "CD127",
        "CD27",
        "CD28",
        "CD8",
        "Cd4",
        "CD57",
    ]
)
ln.File.filter(feature_sets__in=panels_with_cd14).exists()
True
Hide code cell content
# clean up test instance
!lamin delete --force test-flow
!rm -r test-flow
πŸ’‘ deleting instance testuser1/test-flow
βœ…     deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-flow.env
βœ…     instance cache deleted
βœ…     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-flow