Jupyter Notebook

Integrate scRNA-seq datasets#

scRNA-seq data integration is the process of analyzing data from several scRNA sequencing experiments to uncover common or distinct biological insights and patterns.

Here, weโ€™ll demonstrate how to fetch two scRNA-seq datasets by registered metadata such as cell types to finally integrate them.

Setup#

!lamin load test-scrna
Hide code cell output
๐Ÿ’ก found cached instance metadata: /home/runner/.lamin/instance--testuser1--test-scrna.env
โœ… loaded instance: testuser1/test-scrna

import lamindb as ln
import lnschema_bionty as lb
import anndata as ad
โœ… loaded instance: testuser1/test-scrna (lamindb 0.52.1)
ln.track()
๐Ÿ’ก notebook imports: anndata==0.9.2 lamindb==0.52.1 lnschema_bionty==0.30.3
โœ… saved: Transform(id='agayZTonayqAz8', name='Integrate scRNA-seq datasets', short_name='scrna2', version='0', type=notebook, updated_at=2023-09-04 09:34:31, created_by_id='DzTjkKse')
โœ… saved: Run(id='Xg3ti7BLaK33Szq3Z5CE', run_at=2023-09-04 09:34:31, transform_id='agayZTonayqAz8', created_by_id='DzTjkKse')

Access #

Query files by provenance metadata#

users = ln.User.lookup()
ln.Transform.filter(created_by=users.testuser1).search("register scrna")
id __ratio__
name
Validate & register scRNA-seq datasets Nv48yAceNSh8z8 53.846154
Integrate scRNA-seq datasets agayZTonayqAz8 47.619048
transform = ln.Transform.filter(id="Nv48yAceNSh8z8").one()
ln.File.filter(transform=transform).df()
storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id updated_at created_by_id
id
Daaf9uCsQE8YxbVaUdt6 Szavfu1U None .h5ad AnnData Conde22 None 28049505 WEFcMZxJNmMiUOFrcSTaig md5 Nv48yAceNSh8z8 e7dMSW0bVIQUINDGufH3 None 2023-09-04 09:34:00 DzTjkKse
S4DBVuVCt1xuRFb06uIq Szavfu1U None .h5ad AnnData 10x reference pbmc68k None 663138 ezj0ByeaJEju69Ka3vkwFw md5 Nv48yAceNSh8z8 e7dMSW0bVIQUINDGufH3 None 2023-09-04 09:34:24 DzTjkKse

Query files based on biological metadata#

assays = lb.ExperimentalFactor.lookup()
species = lb.Species.lookup()
cell_types = lb.CellType.lookup()
query = ln.File.filter(
    experimental_factors=assays.single_cell_rna_sequencing,
    species=species.human,
    cell_types=cell_types.conventional_dendritic_cell,
)
query.df()
storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id updated_at created_by_id
id
Daaf9uCsQE8YxbVaUdt6 Szavfu1U None .h5ad AnnData Conde22 None 28049505 WEFcMZxJNmMiUOFrcSTaig md5 Nv48yAceNSh8z8 e7dMSW0bVIQUINDGufH3 None 2023-09-04 09:34:00 DzTjkKse
S4DBVuVCt1xuRFb06uIq Szavfu1U None .h5ad AnnData 10x reference pbmc68k None 663138 ezj0ByeaJEju69Ka3vkwFw md5 Nv48yAceNSh8z8 e7dMSW0bVIQUINDGufH3 None 2023-09-04 09:34:24 DzTjkKse

Transform #

Compare gene sets#

Get file objects:

file1, file2 = query.list()
file1.describe()
๐Ÿ’ก File(id='Daaf9uCsQE8YxbVaUdt6', suffix='.h5ad', accessor='AnnData', description='Conde22', size=28049505, hash='WEFcMZxJNmMiUOFrcSTaig', hash_type='md5', updated_at=2023-09-04 09:34:00)

Provenance:
  ๐Ÿ—ƒ๏ธ storage: Storage(id='Szavfu1U', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-09-04 09:34:29, created_by_id='DzTjkKse')
  ๐Ÿ“” transform: Transform(id='Nv48yAceNSh8z8', name='Validate & register scRNA-seq datasets', short_name='scrna', version='0', type='notebook', updated_at=2023-09-04 09:34:24, created_by_id='DzTjkKse')
  ๐Ÿ‘ฃ run: Run(id='e7dMSW0bVIQUINDGufH3', run_at=2023-09-04 09:33:17, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
  ๐Ÿ‘ค created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-04 09:34:29)
Features:
  var: FeatureSet(id='PxAuIGFMKq6tBtlAQT9R', n=36503, type='number', registry='bionty.Gene', hash='dnRexHCtxtmOU81_EpoJ', updated_at=2023-09-04 09:33:56, modality_id='NIYnYOo8', created_by_id='DzTjkKse')
    LIMASI (number)
    DAXX (number)
    CXCL8 (number)
    SLC43A3 (number)
    MTOR (number)
    None (number)
    MIR3936HG (number)
    PIGY (number)
    THOP1 (number)
    None (number)
    ... 
  obs: FeatureSet(id='vjlYQsrjl0Qk1gTkwn5v', n=4, registry='core.Feature', hash='FB8NM5R-dAp_lUBKW16U', updated_at=2023-09-04 09:34:00, modality_id='UouDKKfD', created_by_id='DzTjkKse')
    ๐Ÿ”— cell_type (32, bionty.CellType): 'regulatory T cell', 'germinal center B cell', 'lymphocyte', 'megakaryocyte', 'non-classical monocyte', 'CD16-positive, CD56-dim natural killer cell, human', 'plasmablast', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'dendritic cell, human', 'gamma-delta T cell', ...
    ๐Ÿ”— assay (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 5' v1', '10x 5' v2', '10x 3' v3'
    ๐Ÿ”— tissue (17, bionty.Tissue): 'blood', 'thymus', 'caecum', 'lung', 'thoracic lymph node', 'sigmoid colon', 'ileum', 'duodenum', 'liver', 'mesenteric lymph node', ...
    ๐Ÿ”— donor (12, core.Label): 'A36', 'A31', 'D503', '621B', 'A35', '582C', 'A29', 'A52', '637C', 'A37', ...
file1.view_flow()
https://d33wubrfki0l68.cloudfront.net/b9e9a6429fce78ab3d6266dd4a59b38e92e5774a/975c5/_images/37b7202bdd5dda35d4134870614a5007a6f182bb2ea383965ec24799df7f34a2.svg
file2.describe()
๐Ÿ’ก File(id='S4DBVuVCt1xuRFb06uIq', suffix='.h5ad', accessor='AnnData', description='10x reference pbmc68k', size=663138, hash='ezj0ByeaJEju69Ka3vkwFw', hash_type='md5', updated_at=2023-09-04 09:34:24)

Provenance:
  ๐Ÿ—ƒ๏ธ storage: Storage(id='Szavfu1U', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-09-04 09:34:29, created_by_id='DzTjkKse')
  ๐Ÿ“” transform: Transform(id='Nv48yAceNSh8z8', name='Validate & register scRNA-seq datasets', short_name='scrna', version='0', type='notebook', updated_at=2023-09-04 09:34:24, created_by_id='DzTjkKse')
  ๐Ÿ‘ฃ run: Run(id='e7dMSW0bVIQUINDGufH3', run_at=2023-09-04 09:33:17, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
  ๐Ÿ‘ค created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-04 09:34:29)
Features:
  var: FeatureSet(id='awdEqckqgyaMHEsV80gw', n=753, type='number', registry='bionty.Gene', hash='-FY8VK1f6T3U_MXzH_pj', updated_at=2023-09-04 09:34:24, created_by_id='DzTjkKse')
    EIF3D (number)
    DAXX (number)
    COX6A1 (number)
    MED28 (number)
    SERPINB6 (number)
    ARMH1 (number)
    CD3D (number)
    IGHA1 (number)
    MRPL18 (number)
    NDUFA8 (number)
    ... 
  obs: FeatureSet(id='oR5xshWA4l6b4yE6ANM9', n=1, registry='core.Feature', hash='TBgy-69f02DeCujyo_kM', updated_at=2023-09-04 09:34:24, modality_id='UouDKKfD', created_by_id='DzTjkKse')
    ๐Ÿ”— cell_type (9, bionty.CellType): 'conventional dendritic cell', 'B cell, CD19-positive', 'CD38-negative naive B cell', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'CD14-positive, CD16-negative classical monocyte', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD16-positive, CD56-dim natural killer cell, human', 'dendritic cell', 'cytotoxic T cell'
  external: FeatureSet(id='jnBKcYYcJZ2pme0oVlqw', n=2, registry='core.Feature', hash='mmZjEmOkPr0wfDr63GXa', updated_at=2023-09-04 09:34:24, modality_id='UouDKKfD', created_by_id='DzTjkKse')
    ๐Ÿ”— assay (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
    ๐Ÿ”— species (1, bionty.Species): 'human'
file2.view_flow()
https://d33wubrfki0l68.cloudfront.net/f9a7c27ec847f7241232c98bd5a28bc9484240a4/f2f68/_images/e610ee4b30a4c8c3913c6456e2f02d918b23cefa8435e7c629f09a7e7e1770c4.svg

Load files into memory:

file1_adata = file1.load()
file2_adata = file2.load()
๐Ÿ’ก adding file Daaf9uCsQE8YxbVaUdt6 as input for run Xg3ti7BLaK33Szq3Z5CE, adding parent transform Nv48yAceNSh8z8
๐Ÿ’ก adding file S4DBVuVCt1xuRFb06uIq as input for run Xg3ti7BLaK33Szq3Z5CE, adding parent transform Nv48yAceNSh8z8
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/anndata/_core/anndata.py:1840: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")

Here we compute shared genes without loading files:

file1_genes = file1.features["var"]
file2_genes = file2.features["var"]

shared_genes = file1_genes & file2_genes
len(shared_genes)
748
shared_genes.list("symbol")[:10]
['DAXX',
 'GYPC',
 'GIMAP7',
 'CTSH',
 'PLPP5',
 'S100A4',
 'NFE2',
 'FTL',
 'PAXBP1',
 'LRRC25']

Compare cell types#

file1_celltypes = file1.cell_types.all()
file2_celltypes = file2.cell_types.all()

shared_celltypes = file1_celltypes & file2_celltypes
shared_celltypes_names = shared_celltypes.list("name")
shared_celltypes_names
['CD16-positive, CD56-dim natural killer cell, human',
 'conventional dendritic cell']

We can now subset the two datasets by shared cell types:

file1_adata_subset = file1_adata[
    file1_adata.obs["cell_type"].isin(shared_celltypes_names)
]

file2_adata_subset = file2_adata[
    file2_adata.obs["cell_type"].isin(shared_celltypes_names)
]
Hide code cell content
# clean up test instance
!lamin delete --force test-scrna
!rm -r ./test-scrna
๐Ÿ’ก deleting instance testuser1/test-scrna
โœ…     deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-scrna.env
โœ…     instance cache deleted
โœ…     deleted '.lndb' sqlite file
โ—     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna