Jupyter Notebook

Multi-modalΒΆ

Here, we’ll showcase how to curate and register ECCITE-seq data from Papalexi21 in the form of MuData objects.

ECCITE-seq is designed to enable interrogation of single-cell transcriptomes together with surface protein markers in the context of CRISPR screens.

MuData objects build on top of AnnData objects to store multimodal data.

# !pip install 'lamindb[jupyter,bionty]'
!lamin init --storage ./test-multimodal --schema bionty
Hide code cell output
πŸ’‘ connected lamindb: testuser1/test-multimodal
import lamindb as ln
import bionty as bt
πŸ’‘ connected lamindb: testuser1/test-multimodal
mdata = ln.core.datasets.mudata_papalexi21_subset()
mdata
MuData object with n_obs Γ— n_vars = 200 Γ— 300
  obs:	'perturbation', 'replicate'
  var:	'name'
  4 modalities
    rna:	200 x 173
      obs:	'nCount_RNA', 'nFeature_RNA', 'percent.mito'
      var:	'name'
    adt:	200 x 4
      obs:	'nCount_ADT', 'nFeature_ADT'
      var:	'name'
    hto:	200 x 12
      obs:	'nCount_HTO', 'nFeature_HTO', 'technique'
      var:	'name'
    gdo:	200 x 111
      obs:	'nCount_GDO'
      var:	'name'

Validate annotationsΒΆ

curate = ln.Curate.from_mudata(
    mdata,
    var_index={
        "rna": bt.Gene.symbol, # gene expression
        "adt": bt.CellMarker.name, # antibody derived tags reflecting surface proteins
        "hto": ln.Feature.name, # cell hashing
        "gdo": ln.Feature.name, # guide RNAs
    },
    categoricals={
        "perturbation": ln.ULabel.name,  # shared categorical
        "replicate": ln.ULabel.name, # shared categorical
        "hto:technique": bt.ExperimentalFactor.name # note this is a modality specific categorical
    },
    organism="human",
)
Hide code cell output
βœ… added 2 records with Feature.name for columns: 'perturbation', 'replicate'
❗ 45 non-validated categories are not saved in Feature.name: ['adt:MULTI_ID', 'gdo:nCount_GDO', 'gdo:HTO_classification', 'gdo:MULTI_ID', 'adt:HTO_classification', 'gdo:Phase', 'rna:nFeature_RNA', 'hto:technique', 'hto:nFeature_HTO', 'hto:replicate', 'adt:gene_target', 'gdo:guide_ID', 'hto:Phase', 'adt:guide_ID', 'rna:percent.mito', 'hto:S.Score', 'gdo:orig.ident', 'adt:G2M.Score', 'rna:nCount_RNA', 'gdo:G2M.Score', 'adt:S.Score', 'hto:perturbation', 'adt:nCount_ADT', 'adt:percent.mito', 'gdo:gene_target', 'hto:percent.mito', 'hto:guide_ID', 'hto:gene_target', 'hto:G2M.Score', 'gdo:percent.mito', 'hto:MULTI_ID', 'adt:orig.ident', 'adt:perturbation', 'adt:nFeature_ADT', 'gdo:perturbation', 'adt:NT', 'adt:Phase', 'hto:orig.ident', 'gdo:NT', 'gdo:S.Score', 'hto:NT', 'gdo:replicate', 'adt:replicate', 'hto:HTO_classification', 'hto:nCount_HTO']!
      β†’ to lookup categories, use lookup().columns
      β†’ to save, run add_new_from_columns
❗ 2 non-validated categories are not saved in Feature.name: ['nFeature_ADT', 'nCount_ADT']!
      β†’ to lookup categories, use lookup().columns
      β†’ to save, run add_new_from_columns
❗ 3 non-validated categories are not saved in Feature.name: ['nFeature_RNA', 'nCount_RNA', 'percent.mito']!
      β†’ to lookup categories, use lookup().columns
      β†’ to save, run add_new_from_columns
❗ 1 non-validated categories are not saved in Feature.name: ['nCount_GDO']!
      β†’ to lookup categories, use lookup().columns
      β†’ to save, run add_new_from_columns
βœ… added 1 record with Feature.name for columns: 'technique'
❗ 2 non-validated categories are not saved in Feature.name: ['nFeature_HTO', 'nCount_HTO']!
      β†’ to lookup categories, use lookup().columns
      β†’ to save, run add_new_from_columns
βœ… added 100 records from public with Gene.symbol for var_index: 'SH2D6', 'MEF2C-AS2', 'ARHGAP26-AS1', 'GABRA1', 'H4C12', 'HLA-DQB1-AS1', 'SPACA1', 'VNN1', 'CTAGE15', 'PFKFB1', 'TRPC5', 'RBPMS-AS1', 'CA8', 'CSMD3', 'ZNF483', 'AK8', 'TMEM72-AS1', 'ARAP1-AS2', 'CRYAB', 'DNAI7', ...
❗ 84 non-validated categories are not saved in Gene.symbol: ['RP5-827C21.6', 'XX-CR54.1', 'RP11-379B18.5', 'RP11-778D9.12', 'RP11-703G6.1', 'AC005150.1', 'RP11-717H13.1', 'CTC-498J12.1', 'RP11-524H19.2', 'AC006042.7', 'AC002066.1', 'AC073934.6', 'RP11-268G12.1', 'U52111.14', 'RP11-235C23.5', 'RP11-12J10.3', 'RP11-324E6.9', 'RP11-187A9.3', 'RP11-365N19.2', 'RP11-346D14.1', 'RP11-265N6.2', 'CTD-3065B20.2', 'RP11-304L19.11', 'AC026471.6', 'AC091132.1', 'RP11-138C9.1', 'RP11-75C10.9', 'RP11-835E18.5', 'RP11-760N9.1', 'RP11-17J14.2', 'CTD-3193O13.8', 'AC004019.13', 'RP11-465N4.4', 'RP11-434D9.1', 'RP11-325L7.1', 'RP11-134K13.4', 'RP5-855F16.1', 'RP3-327A19.5', 'RP11-546K22.3', 'RP11-473O4.4', 'RP13-582O9.7', 'RP11-12D24.10', 'RP11-120C12.3', 'RP11-80H5.7', 'RP11-496I9.1', 'AP000442.4', 'RP11-867G23.3', 'RP11-113K21.4', 'RP11-745O10.2', 'RP11-335O4.3', 'RP11-408E5.4', 'AE000662.93', 'AL132989.1', 'RP11-973N13.4', 'RP11-982M15.2', 'RP11-32B5.7', 'RP1-1J6.2', 'RP3-337O18.9', 'AC011558.5', 'CTA-373H7.7', 'RP11-415J8.5', 'AC092687.5', 'RP11-532F6.4', 'RP11-146I2.1', 'RP11-624M8.1', 'RP11-219B4.7', 'RP11-9M16.2', 'RP11-247A12.8', 'RP11-536K7.5', 'RP11-186N15.3', 'RP11-152H18.3', 'CTD-3012A18.1', 'CTD-2562J17.2', 'RP11-136I14.5', 'RP11-110I1.14', 'RP11-2H8.2', 'RP11-307N16.6', 'RP11-3D4.2', 'RP11-231C14.4', 'CTB-134F13.1', 'RP11-403P17.5', 'RP11-214C8.2', 'CTB-31O20.9', 'AC092295.4']!
      β†’ to lookup categories, use lookup().var_index
      β†’ to save, run add_new_from_var_index
βœ… added 4 records from public with CellMarker.name for var_index: 'CD86', 'PDL1', 'PDL2', 'CD366'
❗ 12 non-validated categories are not saved in Feature.name: ['rep1-tx', 'rep1-ctrl', 'rep2-tx', 'rep2-ctrl', 'PDL1g1-tx', 'PDL1g1-ctrl', 'PDL1g2-tx', 'PDL1g2-ctrl', 'rep3-tx', 'rep3-ctrl', 'rep4-tx', 'rep4-ctrl']!
      β†’ to lookup categories, use lookup().var_index
      β†’ to save, run add_new_from_var_index
❗ 111 non-validated categories are not saved in Feature.name: ['eGFPg1', 'CUL3g1', 'CUL3g2', 'CUL3g3', 'CMTM6g1', 'CMTM6g2', 'CMTM6g3', 'NTg1', 'NTg2', 'NTg3', 'NTg4', 'NTg5', 'NTg7', 'PDL1g1', 'PDL1g2', 'PDL1g3', 'ATF2g1', 'ATF2g2', 'ATF2g3', 'ATF2g4', 'BRD4g1', 'BRD4g2', 'BRD4g3', 'BRD4g4', 'CAV1g1', 'CAV1g2', 'CAV1g3', 'CAV1g4', 'CD86g1', 'CD86g2', 'CD86g3', 'CD86g4', 'ETV7g1', 'ETV7g2', 'ETV7g3', 'ETV7g4', 'IFNGR1g1', 'IFNGR1g2', 'IFNGR1g3', 'IFNGR1g4', 'IFNGR2g1', 'IFNGR2g2', 'IFNGR2g3', 'IFNGR2g4', 'IRF1g1', 'IRF1g2', 'IRF1g3', 'IRF1g4', 'IRF7g1', 'IRF7g2', 'IRF7g3', 'IRF7g4', 'JAK2g1', 'JAK2g2', 'JAK2g3', 'JAK2g4', 'MARCH8g1', 'MARCH8g2', 'MARCH8g3', 'MARCH8g4', 'MYCg1', 'MYCg2', 'MYCg3', 'MYCg4', 'NFKBIAg1', 'NFKBIAg2', 'NFKBIAg3', 'NFKBIAg4', 'PDCD1LG2g1', 'PDCD1LG2g2', 'PDCD1LG2g3', 'PDCD1LG2g4', 'POU2F2g1', 'POU2F2g2', 'POU2F2g3', 'POU2F2g4', 'SMAD4g1', 'SMAD4g2', 'SMAD4g3', 'SMAD4g4', 'SPI1g1', 'SPI1g2', 'SPI1g3', 'SPI1g4', 'STAT1g1', 'STAT1g2', 'STAT1g3', 'STAT1g4', 'STAT2g1', 'STAT2g2', 'STAT2g3', 'STAT2g4', 'STAT3g1', 'STAT3g2', 'STAT3g3', 'STAT3g4', 'STAT5Ag1', 'STAT5Ag2', 'STAT5Ag3', 'STAT5Ag4', 'TNFRSF14g1', 'TNFRSF14g2', 'TNFRSF14g3', 'TNFRSF14g4', 'UBE2L6g1', 'UBE2L6g2', 'UBE2L6g3', 'UBE2L6g4', 'NTg8', 'NTg9', 'NTg10']!
      β†’ to lookup categories, use lookup().var_index
      β†’ to save, run add_new_from_var_index
# add new gene symbols from the ['rna'].var.index
curate.add_new_from_var_index("rna")

# add new categories from the hto and gdo var.index
curate.add_new_from_var_index("hto")
curate.add_new_from_var_index("gdo")

# optional: register additional columns we'd like to curate
curate.add_new_from_columns(modality="rna")
curate.add_new_from_columns(modality="adt")
curate.add_new_from_columns(modality="hto")
curate.add_new_from_columns(modality="gdo")
Hide code cell output
βœ… added 84 records with Gene.symbol for var_index: 'RP5-827C21.6', 'XX-CR54.1', 'RP11-379B18.5', 'RP11-778D9.12', 'RP11-703G6.1', 'AC005150.1', 'RP11-717H13.1', 'CTC-498J12.1', 'RP11-524H19.2', 'AC006042.7', 'AC002066.1', 'AC073934.6', 'RP11-268G12.1', 'U52111.14', 'RP11-235C23.5', 'RP11-12J10.3', 'RP11-324E6.9', 'RP11-187A9.3', 'RP11-365N19.2', 'RP11-346D14.1', ...
βœ… added 12 records with Feature.name for var_index: 'rep1-tx', 'rep1-ctrl', 'rep2-tx', 'rep2-ctrl', 'PDL1g1-tx', 'PDL1g1-ctrl', 'PDL1g2-tx', 'PDL1g2-ctrl', 'rep3-tx', 'rep3-ctrl', 'rep4-tx', 'rep4-ctrl'
βœ… added 111 records with Feature.name for var_index: 'eGFPg1', 'CUL3g1', 'CUL3g2', 'CUL3g3', 'CMTM6g1', 'CMTM6g2', 'CMTM6g3', 'NTg1', 'NTg2', 'NTg3', 'NTg4', 'NTg5', 'NTg7', 'PDL1g1', 'PDL1g2', 'PDL1g3', 'ATF2g1', 'ATF2g2', 'ATF2g3', 'ATF2g4', ...
βœ… added 3 records with Feature.name for rna obs columns: 'nCount_RNA', 'nFeature_RNA', 'percent.mito'
βœ… added 2 records with Feature.name for adt obs columns: 'nCount_ADT', 'nFeature_ADT'
βœ… added 2 records with Feature.name for hto obs columns: 'nCount_HTO', 'nFeature_HTO'
βœ… added 1 record with Feature.name for gdo obs columns: 'nCount_GDO'
curate.validate()
Hide code cell output
βœ… rna_var_index is validated against Gene.symbol
βœ… adt_var_index is validated against CellMarker.name
βœ… hto_var_index is validated against Feature.name
βœ… gdo_var_index is validated against Feature.name
πŸ’‘ mapping perturbation on ULabel.name
❗    2 terms are not validated: 'Perturbed', 'NT'
      β†’ save terms via .add_new_from('perturbation')
πŸ’‘ mapping replicate on ULabel.name
❗    3 terms are not validated: 'rep3', 'rep1', 'rep2'
      β†’ save terms via .add_new_from('replicate')
πŸ’‘ mapping technique on ExperimentalFactor.name
❗    found 1 terms validated terms: ['cell hashing']
      β†’ save terms via .add_validated_from('technique')
βœ… technique is validated against ExperimentalFactor.name
False
# add validated and new categories
curate.add_new_from("perturbation")
curate.add_new_from("replicate")
curate.add_validated_from("technique", modality="hto")
Hide code cell output
βœ… added 2 records with ULabel.name for perturbation: 'Perturbed', 'NT'
βœ… added 3 records with ULabel.name for replicate: 'rep3', 'rep1', 'rep2'
βœ… added 1 record from public with ExperimentalFactor.name for technique: 'cell hashing'
curate.validate()
Hide code cell output
βœ… rna_var_index is validated against Gene.symbol
βœ… adt_var_index is validated against CellMarker.name
βœ… hto_var_index is validated against Feature.name
βœ… gdo_var_index is validated against Feature.name
βœ… perturbation is validated against ULabel.name
βœ… replicate is validated against ULabel.name
βœ… technique is validated against ExperimentalFactor.name
True

Register curated artifactΒΆ

artifact = curate.save_artifact(description="Sub-sampled MuData from Papalexi21")
Hide code cell output
❗ no run & transform get linked, consider calling ln.track()
πŸ’‘ path content will be copied to default storage upon `save()` with key `None` ('.lamindb/4PtGqkJOxdPNcR3SizdY.h5mu')
βœ… storing artifact '4PtGqkJOxdPNcR3SizdY' at '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-multimodal/.lamindb/4PtGqkJOxdPNcR3SizdY.h5mu'
πŸ’‘ you can auto-track these data as a run input by calling `ln.track()`
βœ… loaded 2 Feature records matching name: 'perturbation', 'replicate'
❗ did not create Feature records for 45 non-validated names: 'adt:G2M.Score', 'adt:HTO_classification', 'adt:MULTI_ID', 'adt:NT', 'adt:Phase', 'adt:S.Score', 'adt:gene_target', 'adt:guide_ID', 'adt:nCount_ADT', 'adt:nFeature_ADT', 'adt:orig.ident', 'adt:percent.mito', 'adt:perturbation', 'adt:replicate', 'gdo:G2M.Score', 'gdo:HTO_classification', 'gdo:MULTI_ID', 'gdo:NT', 'gdo:Phase', 'gdo:S.Score', ...
πŸ’‘ parsing feature names of X stored in slot 'var'
βœ…    161 terms (93.10%) are validated for symbol
❗    12 terms (6.90%) are not validated for symbol: CTC-467M3.1, HIST1H4K, CASC1, LARGE, NBPF16, C1orf65, IBA57-AS1, KIAA1239, TMEM75, AP003419.16, FAM65C, C14orf177
βœ…    linked: FeatureSet(uid='TaxmtIcrLnBBppJ4Hw10', n=172, dtype='float', registry='bionty.Gene', hash='y1Qo897t3gp9S3it4dz6', created_by_id=1)
πŸ’‘ parsing feature names of slot 'obs'
βœ…    3 terms (100.00%) are validated for name
βœ…    linked: FeatureSet(uid='6Kzka3aKTJl6y9fQCFm4', n=3, registry='Feature', hash='FYda-45Zb5SzhryU1Bjg', created_by_id=1)
πŸ’‘ parsing feature names of X stored in slot 'var'
βœ…    4 terms (100.00%) are validated for name
βœ…    linked: FeatureSet(uid='n8GRsGVK1Y14FzDKE6rX', n=4, dtype='float', registry='bionty.CellMarker', hash='o8EDT805HnP0Fmk4uZ9e', created_by_id=1)
πŸ’‘ parsing feature names of slot 'obs'
βœ…    2 terms (100.00%) are validated for name
βœ…    linked: FeatureSet(uid='IezWh7MQRbhdknJISqPR', n=2, registry='Feature', hash='cGffD98oe4NiAEVzivGW', created_by_id=1)
πŸ’‘ parsing feature names of X stored in slot 'var'
βœ…    12 terms (100.00%) are validated for name
βœ…    linked: FeatureSet(uid='yaiw6vNtA3LfAQE8I54c', n=12, dtype='float', registry='Feature', hash='2Zj96aEN3NOmGzWFOUcm', created_by_id=1)
πŸ’‘ parsing feature names of slot 'obs'
βœ…    3 terms (100.00%) are validated for name
βœ…    linked: FeatureSet(uid='AYCpsZDuipoTAO3kKiLH', n=3, registry='Feature', hash='sZaszdacI5nt2t4xaVr8', created_by_id=1)
πŸ’‘ parsing feature names of X stored in slot 'var'
βœ…    111 terms (100.00%) are validated for name
βœ…    linked: FeatureSet(uid='ylCuW7IOMQtcFLAojAtQ', n=111, dtype='float', registry='Feature', hash='9-DEEW2MDh2sLPxtTDQ1', created_by_id=1)
πŸ’‘ parsing feature names of slot 'obs'
βœ…    1 term (100.00%) is validated for name
βœ…    linked: FeatureSet(uid='HeZK9mBIvotiwRTmJ8j7', n=1, registry='Feature', hash='2h-YnMpBbZpOF-p4ntvz', created_by_id=1)
βœ… saved 9 feature sets for slots: 'obs','['rna'].var','['rna'].obs','['adt'].var','['adt'].obs','['hto'].var','['hto'].obs','['gdo'].var','['gdo'].obs'
artifact.describe()
Artifact(uid='4PtGqkJOxdPNcR3SizdY', description='Sub-sampled MuData from Papalexi21', suffix='.h5mu', type='dataset', accessor='MuData', size=545560, hash='bDhIaWgBTQTbk5BFV6kORw', hash_type='md5', n_observations=200, visibility=1, key_is_virtual=True, updated_at='2024-07-21 16:24:15 UTC')
  Provenance
    .created_by = 'testuser1'
    .storage = '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-multimodal'
  Labels
    .experimental_factors = 'cell hashing'
    .ulabels = 'Perturbed', 'NT', 'rep3', 'rep1', 'rep2'
  Features
    'perturbation' = 'Perturbed', 'NT'
    'replicate' = 'rep3', 'rep1', 'rep2'
    'technique' = 'cell hashing'
  Feature sets
    'obs' = 'perturbation', 'replicate'
    '['rna'].var' = 'SH2D6', 'ARHGAP26-AS1', 'GABRA1', 'HLA-DQB1-AS1', 'SPACA1', 'VNN1', 'CTAGE15', 'PFKFB1', 'TRPC5', 'RBPMS-AS1', 'CA8', 'CSMD3', 'ZNF483'
    '['rna'].obs' = 'nFeature_RNA', 'percent.mito', 'nCount_RNA'
    '['adt'].var' = 'CD86', 'PDL1', 'PDL2', 'CD366'
    '['adt'].obs' = 'nCount_ADT', 'nFeature_ADT'
    '['hto'].var' = 'rep1-tx', 'rep1-ctrl', 'rep2-tx', 'rep2-ctrl', 'PDL1g1-tx', 'PDL1g1-ctrl', 'PDL1g2-tx', 'PDL1g2-ctrl', 'rep3-tx', 'rep3-ctrl', 'rep4-tx', 'rep4-ctrl'
    '['hto'].obs' = 'technique', 'nCount_HTO', 'nFeature_HTO'
    '['gdo'].var' = 'eGFPg1', 'CUL3g1', 'CUL3g2', 'CUL3g3', 'CMTM6g1', 'CMTM6g2', 'CMTM6g3', 'NTg1', 'NTg2', 'NTg3', 'NTg4', 'NTg5', 'NTg7', 'PDL1g1', 'PDL1g2', 'PDL1g3', 'ATF2g1', 'ATF2g2', 'ATF2g3', 'ATF2g4'
    '['gdo'].obs' = 'nCount_GDO'
# clean up test instance
!rm -r test-multimodal
!lamin delete --force test-multimodal
Hide code cell output
πŸ’‘ deleting instance testuser1/test-multimodal