Integrate scRNA-seq datasets#
scRNA-seq data integration is the process of analyzing data from several scRNA sequencing experiments to uncover common or distinct biological insights and patterns.
Here, weโll demonstrate how to fetch two scRNA-seq datasets by registered metadata such as cell types to finally integrate them.
Setup#
!lamin load test-scrna
Show code cell output
๐ก found cached instance metadata: /home/runner/.lamin/instance--testuser1--test-scrna.env
โ
loaded instance: testuser1/test-scrna
import lamindb as ln
import lnschema_bionty as lb
import anndata as ad
โ
loaded instance: testuser1/test-scrna (lamindb 0.52.1)
ln.track()
๐ก notebook imports: anndata==0.9.2 lamindb==0.52.1 lnschema_bionty==0.30.3
โ
saved: Transform(id='agayZTonayqAz8', name='Integrate scRNA-seq datasets', short_name='scrna2', version='0', type=notebook, updated_at=2023-09-04 09:34:31, created_by_id='DzTjkKse')
โ
saved: Run(id='Xg3ti7BLaK33Szq3Z5CE', run_at=2023-09-04 09:34:31, transform_id='agayZTonayqAz8', created_by_id='DzTjkKse')
Access #
Query files by provenance metadata#
users = ln.User.lookup()
ln.Transform.filter(created_by=users.testuser1).search("register scrna")
id | __ratio__ | |
---|---|---|
name | ||
Validate & register scRNA-seq datasets | Nv48yAceNSh8z8 | 53.846154 |
Integrate scRNA-seq datasets | agayZTonayqAz8 | 47.619048 |
transform = ln.Transform.filter(id="Nv48yAceNSh8z8").one()
ln.File.filter(transform=transform).df()
storage_id | key | suffix | accessor | description | version | size | hash | hash_type | transform_id | run_id | initial_version_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
Daaf9uCsQE8YxbVaUdt6 | Szavfu1U | None | .h5ad | AnnData | Conde22 | None | 28049505 | WEFcMZxJNmMiUOFrcSTaig | md5 | Nv48yAceNSh8z8 | e7dMSW0bVIQUINDGufH3 | None | 2023-09-04 09:34:00 | DzTjkKse |
S4DBVuVCt1xuRFb06uIq | Szavfu1U | None | .h5ad | AnnData | 10x reference pbmc68k | None | 663138 | ezj0ByeaJEju69Ka3vkwFw | md5 | Nv48yAceNSh8z8 | e7dMSW0bVIQUINDGufH3 | None | 2023-09-04 09:34:24 | DzTjkKse |
Query files based on biological metadata#
assays = lb.ExperimentalFactor.lookup()
species = lb.Species.lookup()
cell_types = lb.CellType.lookup()
query = ln.File.filter(
experimental_factors=assays.single_cell_rna_sequencing,
species=species.human,
cell_types=cell_types.conventional_dendritic_cell,
)
query.df()
storage_id | key | suffix | accessor | description | version | size | hash | hash_type | transform_id | run_id | initial_version_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
Daaf9uCsQE8YxbVaUdt6 | Szavfu1U | None | .h5ad | AnnData | Conde22 | None | 28049505 | WEFcMZxJNmMiUOFrcSTaig | md5 | Nv48yAceNSh8z8 | e7dMSW0bVIQUINDGufH3 | None | 2023-09-04 09:34:00 | DzTjkKse |
S4DBVuVCt1xuRFb06uIq | Szavfu1U | None | .h5ad | AnnData | 10x reference pbmc68k | None | 663138 | ezj0ByeaJEju69Ka3vkwFw | md5 | Nv48yAceNSh8z8 | e7dMSW0bVIQUINDGufH3 | None | 2023-09-04 09:34:24 | DzTjkKse |
Transform #
Compare gene sets#
Get file objects:
file1, file2 = query.list()
file1.describe()
๐ก File(id='Daaf9uCsQE8YxbVaUdt6', suffix='.h5ad', accessor='AnnData', description='Conde22', size=28049505, hash='WEFcMZxJNmMiUOFrcSTaig', hash_type='md5', updated_at=2023-09-04 09:34:00)
Provenance:
๐๏ธ storage: Storage(id='Szavfu1U', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-09-04 09:34:29, created_by_id='DzTjkKse')
๐ transform: Transform(id='Nv48yAceNSh8z8', name='Validate & register scRNA-seq datasets', short_name='scrna', version='0', type='notebook', updated_at=2023-09-04 09:34:24, created_by_id='DzTjkKse')
๐ฃ run: Run(id='e7dMSW0bVIQUINDGufH3', run_at=2023-09-04 09:33:17, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
๐ค created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-04 09:34:29)
Features:
var: FeatureSet(id='PxAuIGFMKq6tBtlAQT9R', n=36503, type='number', registry='bionty.Gene', hash='dnRexHCtxtmOU81_EpoJ', updated_at=2023-09-04 09:33:56, modality_id='NIYnYOo8', created_by_id='DzTjkKse')
LIMASI (number)
DAXX (number)
CXCL8 (number)
SLC43A3 (number)
MTOR (number)
None (number)
MIR3936HG (number)
PIGY (number)
THOP1 (number)
None (number)
...
obs: FeatureSet(id='vjlYQsrjl0Qk1gTkwn5v', n=4, registry='core.Feature', hash='FB8NM5R-dAp_lUBKW16U', updated_at=2023-09-04 09:34:00, modality_id='UouDKKfD', created_by_id='DzTjkKse')
๐ cell_type (32, bionty.CellType): 'regulatory T cell', 'germinal center B cell', 'lymphocyte', 'megakaryocyte', 'non-classical monocyte', 'CD16-positive, CD56-dim natural killer cell, human', 'plasmablast', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'dendritic cell, human', 'gamma-delta T cell', ...
๐ assay (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 5' v1', '10x 5' v2', '10x 3' v3'
๐ tissue (17, bionty.Tissue): 'blood', 'thymus', 'caecum', 'lung', 'thoracic lymph node', 'sigmoid colon', 'ileum', 'duodenum', 'liver', 'mesenteric lymph node', ...
๐ donor (12, core.Label): 'A36', 'A31', 'D503', '621B', 'A35', '582C', 'A29', 'A52', '637C', 'A37', ...
file1.view_flow()
file2.describe()
๐ก File(id='S4DBVuVCt1xuRFb06uIq', suffix='.h5ad', accessor='AnnData', description='10x reference pbmc68k', size=663138, hash='ezj0ByeaJEju69Ka3vkwFw', hash_type='md5', updated_at=2023-09-04 09:34:24)
Provenance:
๐๏ธ storage: Storage(id='Szavfu1U', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-09-04 09:34:29, created_by_id='DzTjkKse')
๐ transform: Transform(id='Nv48yAceNSh8z8', name='Validate & register scRNA-seq datasets', short_name='scrna', version='0', type='notebook', updated_at=2023-09-04 09:34:24, created_by_id='DzTjkKse')
๐ฃ run: Run(id='e7dMSW0bVIQUINDGufH3', run_at=2023-09-04 09:33:17, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
๐ค created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-04 09:34:29)
Features:
var: FeatureSet(id='awdEqckqgyaMHEsV80gw', n=753, type='number', registry='bionty.Gene', hash='-FY8VK1f6T3U_MXzH_pj', updated_at=2023-09-04 09:34:24, created_by_id='DzTjkKse')
EIF3D (number)
DAXX (number)
COX6A1 (number)
MED28 (number)
SERPINB6 (number)
ARMH1 (number)
CD3D (number)
IGHA1 (number)
MRPL18 (number)
NDUFA8 (number)
...
obs: FeatureSet(id='oR5xshWA4l6b4yE6ANM9', n=1, registry='core.Feature', hash='TBgy-69f02DeCujyo_kM', updated_at=2023-09-04 09:34:24, modality_id='UouDKKfD', created_by_id='DzTjkKse')
๐ cell_type (9, bionty.CellType): 'conventional dendritic cell', 'B cell, CD19-positive', 'CD38-negative naive B cell', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'CD14-positive, CD16-negative classical monocyte', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD16-positive, CD56-dim natural killer cell, human', 'dendritic cell', 'cytotoxic T cell'
external: FeatureSet(id='jnBKcYYcJZ2pme0oVlqw', n=2, registry='core.Feature', hash='mmZjEmOkPr0wfDr63GXa', updated_at=2023-09-04 09:34:24, modality_id='UouDKKfD', created_by_id='DzTjkKse')
๐ assay (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
๐ species (1, bionty.Species): 'human'
file2.view_flow()
Load files into memory:
file1_adata = file1.load()
file2_adata = file2.load()
๐ก adding file Daaf9uCsQE8YxbVaUdt6 as input for run Xg3ti7BLaK33Szq3Z5CE, adding parent transform Nv48yAceNSh8z8
๐ก adding file S4DBVuVCt1xuRFb06uIq as input for run Xg3ti7BLaK33Szq3Z5CE, adding parent transform Nv48yAceNSh8z8
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/anndata/_core/anndata.py:1840: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
Here we compute shared genes without loading files:
file1_genes = file1.features["var"]
file2_genes = file2.features["var"]
shared_genes = file1_genes & file2_genes
len(shared_genes)
748
shared_genes.list("symbol")[:10]
['DAXX',
'GYPC',
'GIMAP7',
'CTSH',
'PLPP5',
'S100A4',
'NFE2',
'FTL',
'PAXBP1',
'LRRC25']
Compare cell types#
file1_celltypes = file1.cell_types.all()
file2_celltypes = file2.cell_types.all()
shared_celltypes = file1_celltypes & file2_celltypes
shared_celltypes_names = shared_celltypes.list("name")
shared_celltypes_names
['CD16-positive, CD56-dim natural killer cell, human',
'conventional dendritic cell']
We can now subset the two datasets by shared cell types:
file1_adata_subset = file1_adata[
file1_adata.obs["cell_type"].isin(shared_celltypes_names)
]
file2_adata_subset = file2_adata[
file2_adata.obs["cell_type"].isin(shared_celltypes_names)
]
Show code cell content
# clean up test instance
!lamin delete --force test-scrna
!rm -r ./test-scrna
๐ก deleting instance testuser1/test-scrna
โ
deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-scrna.env
โ
instance cache deleted
โ
deleted '.lndb' sqlite file
โ consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna