Cell BLAST tutorial

[1]:
import time
import warnings
import anndata
import Cell_BLAST as cb

warnings.filterwarnings("ignore")
cb.config.N_JOBS = 4
cb.config.RANDOM_SEED = 0

Preparing database

In this tutorial, we demonstrate how to perform Cell BLAST based on DIRECTi models.

Again, we use the human pancreatic islet dataset Baron_human as an example.

[2]:
baron_human = anndata.read_h5ad("Baron_human.h5ad")
[3]:
%%capture
axes = cb.data.find_variable_genes(baron_human, grouping="donor")

Cell BLAST uses multiple models to increase specificity.

Here we first train 4 DIRECTi models, each with a different random seed.

Please refer to the accompanying DIRECTi notebook for more detailed introduction to model training.

[4]:
%%capture
start_time=time.time()
models = []
for i in range(4):
    models.append(cb.directi.fit_DIRECTi(
        baron_human, genes=baron_human.var.query("variable_genes").index,
        latent_dim=10, cat_dim=20, random_seed=i
    ))
[INFO] Cell BLAST: Using model path: /tmp/tmp3k89_jx1
[INFO] Cell BLAST: Using model path: /tmp/tmpm1gqyon4
[INFO] Cell BLAST: Using model path: /tmp/tmpj6zf6z9g
[INFO] Cell BLAST: Using model path: /tmp/tmpglx_f5fl
[5]:
print("Time elapsed: %.1fs" % (time.time() - start_time))
Time elapsed: 768.9s

Then we build a Cell BLAST “database” by feeding our previously trained models and the reference dataset.

[6]:
blast = cb.blast.BLAST(models, baron_human)
[INFO] Cell BLAST: Projecting to latent space...
[INFO] Cell BLAST: Fitting nearest neighbor trees...
[INFO] Cell BLAST: Sampling from posteriors...
/home/lurw/anaconda3/envs/cbp/lib/python3.9/site-packages/torch/nn/modules/container.py:435: UserWarning: Setting attributes on ParameterList is not supported.
  warnings.warn("Setting attributes on ParameterList is not supported.")
/home/lurw/anaconda3/envs/cbp/lib/python3.9/site-packages/torch/nn/modules/container.py:435: UserWarning: Setting attributes on ParameterList is not supported.
  warnings.warn("Setting attributes on ParameterList is not supported.")
/home/lurw/anaconda3/envs/cbp/lib/python3.9/site-packages/torch/nn/modules/container.py:435: UserWarning: Setting attributes on ParameterList is not supported.
  warnings.warn("Setting attributes on ParameterList is not supported.")
/home/lurw/anaconda3/envs/cbp/lib/python3.9/site-packages/torch/nn/modules/container.py:435: UserWarning: Setting attributes on ParameterList is not supported.
  warnings.warn("Setting attributes on ParameterList is not supported.")
[INFO] Cell BLAST: Generating empirical null distributions...

Like DIRECTi models, BLAST objects can be easily saved and loaded.

[7]:
blast.save("./baron_human_blast")
del blast
blast = cb.blast.BLAST.load("./baron_human_blast")
[INFO] Cell BLAST: Using model path: ./baron_human_blast/model_0
[INFO] Cell BLAST: Using model path: ./baron_human_blast/model_1
[INFO] Cell BLAST: Using model path: ./baron_human_blast/model_2
[INFO] Cell BLAST: Using model path: ./baron_human_blast/model_3
[INFO] Cell BLAST: Fitting nearest neighbor trees...

Querying

We load another human pancreatic islet dataset Lawlor to demonstrate the querying process.

Note that we do NOT perform data normalization or gene subsetting here. These should be internally handled by the BLAST object later in querying.

[8]:
lawlor = anndata.read_h5ad("Lawlor.h5ad")

To query the database, we first use the query() method to obtain initial hits in the reference database. This is done by efficient Euclidean distance based nearest neighbor search in the latent space. Nearest neighbors in the latent space of each model will be merged. Though highly efficient, latent space Euclidean distance is not the best metric to determine cell-cell similarity. To increase accuracy and specificity, we also compute posterior distribution distances as well as empirical p-values for these nearest neighbors.

[9]:
start_time = time.time()
lawlor_hits = blast.query(lawlor)
print("Time per query: %.1fms" % (
    (time.time() - start_time) * 1000 / lawlor.shape[0]
))
[INFO] Cell BLAST: Projecting to latent space...
[WARNING] Cell BLAST: 21 out of 769 variables are not found, will be set to zero!
[INFO] Cell BLAST: 'ADGRF5', 'ADGRL4', 'CD24', 'CEMIP', 'CXCL8', 'H19', 'HLA.B', 'HLA.DPA1', 'HLA.DPB1', 'HLA.DRA', 'HLA.DRB1', 'MIR4435.2HG', 'NEURL3', 'PCAT19', 'PECAM1', 'PRSS2', 'SMIM24', 'TPSB2', 'UCA1', 'UNQ6494', 'pk'
[WARNING] Cell BLAST: 21 out of 769 variables are not found, will be set to zero!
[INFO] Cell BLAST: 'ADGRF5', 'ADGRL4', 'CD24', 'CEMIP', 'CXCL8', 'H19', 'HLA.B', 'HLA.DPA1', 'HLA.DPB1', 'HLA.DRA', 'HLA.DRB1', 'MIR4435.2HG', 'NEURL3', 'PCAT19', 'PECAM1', 'PRSS2', 'SMIM24', 'TPSB2', 'UCA1', 'UNQ6494', 'pk'
[WARNING] Cell BLAST: 21 out of 769 variables are not found, will be set to zero!
[INFO] Cell BLAST: 'ADGRF5', 'ADGRL4', 'CD24', 'CEMIP', 'CXCL8', 'H19', 'HLA.B', 'HLA.DPA1', 'HLA.DPB1', 'HLA.DRA', 'HLA.DRB1', 'MIR4435.2HG', 'NEURL3', 'PCAT19', 'PECAM1', 'PRSS2', 'SMIM24', 'TPSB2', 'UCA1', 'UNQ6494', 'pk'
[WARNING] Cell BLAST: 21 out of 769 variables are not found, will be set to zero!
[INFO] Cell BLAST: 'ADGRF5', 'ADGRL4', 'CD24', 'CEMIP', 'CXCL8', 'H19', 'HLA.B', 'HLA.DPA1', 'HLA.DPB1', 'HLA.DRA', 'HLA.DRB1', 'MIR4435.2HG', 'NEURL3', 'PCAT19', 'PECAM1', 'PRSS2', 'SMIM24', 'TPSB2', 'UCA1', 'UNQ6494', 'pk'
[INFO] Cell BLAST: Doing nearest neighbor search...
[INFO] Cell BLAST: Merging hits across models...
[INFO] Cell BLAST: Computing posterior distribution distances...
/home/lurw/anaconda3/envs/cbp/lib/python3.9/site-packages/torch/nn/modules/container.py:435: UserWarning: Setting attributes on ParameterList is not supported.
  warnings.warn("Setting attributes on ParameterList is not supported.")
/home/lurw/anaconda3/envs/cbp/lib/python3.9/site-packages/torch/nn/modules/container.py:435: UserWarning: Setting attributes on ParameterList is not supported.
  warnings.warn("Setting attributes on ParameterList is not supported.")
/home/lurw/anaconda3/envs/cbp/lib/python3.9/site-packages/torch/nn/modules/container.py:435: UserWarning: Setting attributes on ParameterList is not supported.
  warnings.warn("Setting attributes on ParameterList is not supported.")
/home/lurw/anaconda3/envs/cbp/lib/python3.9/site-packages/torch/nn/modules/container.py:435: UserWarning: Setting attributes on ParameterList is not supported.
  warnings.warn("Setting attributes on ParameterList is not supported.")
[WARNING] Cell BLAST: 21 out of 769 variables are not found, will be set to zero!
[INFO] Cell BLAST: 'ADGRF5', 'ADGRL4', 'CD24', 'CEMIP', 'CXCL8', 'H19', 'HLA.B', 'HLA.DPA1', 'HLA.DPB1', 'HLA.DRA', 'HLA.DRB1', 'MIR4435.2HG', 'NEURL3', 'PCAT19', 'PECAM1', 'PRSS2', 'SMIM24', 'TPSB2', 'UCA1', 'UNQ6494', 'pk'
[WARNING] Cell BLAST: 21 out of 769 variables are not found, will be set to zero!
[INFO] Cell BLAST: 'ADGRF5', 'ADGRL4', 'CD24', 'CEMIP', 'CXCL8', 'H19', 'HLA.B', 'HLA.DPA1', 'HLA.DPB1', 'HLA.DRA', 'HLA.DRB1', 'MIR4435.2HG', 'NEURL3', 'PCAT19', 'PECAM1', 'PRSS2', 'SMIM24', 'TPSB2', 'UCA1', 'UNQ6494', 'pk'
[WARNING] Cell BLAST: 21 out of 769 variables are not found, will be set to zero!
[INFO] Cell BLAST: 'ADGRF5', 'ADGRL4', 'CD24', 'CEMIP', 'CXCL8', 'H19', 'HLA.B', 'HLA.DPA1', 'HLA.DPB1', 'HLA.DRA', 'HLA.DRB1', 'MIR4435.2HG', 'NEURL3', 'PCAT19', 'PECAM1', 'PRSS2', 'SMIM24', 'TPSB2', 'UCA1', 'UNQ6494', 'pk'
[WARNING] Cell BLAST: 21 out of 769 variables are not found, will be set to zero!
[INFO] Cell BLAST: 'ADGRF5', 'ADGRL4', 'CD24', 'CEMIP', 'CXCL8', 'H19', 'HLA.B', 'HLA.DPA1', 'HLA.DPB1', 'HLA.DRA', 'HLA.DRB1', 'MIR4435.2HG', 'NEURL3', 'PCAT19', 'PECAM1', 'PRSS2', 'SMIM24', 'TPSB2', 'UCA1', 'UNQ6494', 'pk'
[INFO] Cell BLAST: Computing empirical p-values...
Time per query: 72.5ms

Then we use reconcile_models() to pool together informarion from multiple models and filter() the initial hits to obtain significant hits.

[10]:
lawlor_hits = lawlor_hits.reconcile_models().filter(by="pval", cutoff=0.05)

Optionally, we may use the to_data_frames() method to extract detailed information about the query hits.

The return value is a python dict, with query cell names as keys and meta table of query hits as values.

[11]:
hits_dict = lawlor_hits[0:5].to_data_frames()
hits_dict.keys()
[11]:
odict_keys(['1st-61_S27', '1st-C11_S58', '1st-C13_S19', '1st-C15_S3', '1st-C18_S51'])
[12]:
hits_dict["1st-61_S27"]
[12]:
donor cell_type1 library organism dataset_name platform organ data_type cell_ontology_class cell_ontology_id n_genes n_counts __libsize__ hits dist pval
cell_id
human1_lib1.final_cell_0078 1 beta lib1 Homo sapiens Baron_human inDrop Pancreas raw type B pancreatic cell CL:0000169 2898 8881 8881.0 77 11.782958 0.005310
human1_lib1.final_cell_0170 1 beta lib1 Homo sapiens Baron_human inDrop Pancreas raw type B pancreatic cell CL:0000169 2478 7664 7664.0 169 8.440894 0.001088
human1_lib2.final_cell_0240 1 beta lib2 Homo sapiens Baron_human inDrop Pancreas raw type B pancreatic cell CL:0000169 2096 6842 6842.0 807 14.762918 0.014100
human1_lib2.final_cell_0274 1 beta lib2 Homo sapiens Baron_human inDrop Pancreas raw type B pancreatic cell CL:0000169 2030 5178 5178.0 841 12.932139 0.007728
human1_lib3.final_cell_0339 1 beta lib3 Homo sapiens Baron_human inDrop Pancreas raw type B pancreatic cell CL:0000169 1981 7872 7872.0 1535 15.405832 0.016093
human2_lib1.final_cell_0254 2 beta lib1 Homo sapiens Baron_human inDrop Pancreas raw type B pancreatic cell CL:0000169 2522 7224 7224.0 2190 10.192962 0.003189
human2_lib2.final_cell_0061 2 beta lib2 Homo sapiens Baron_human inDrop Pancreas raw type B pancreatic cell CL:0000169 2963 6470 6470.0 2556 11.842522 0.005558
human2_lib2.final_cell_0398 2 beta lib2 Homo sapiens Baron_human inDrop Pancreas raw type B pancreatic cell CL:0000169 1617 4644 4644.0 2893 17.342208 0.023050
human2_lib3.final_cell_0107 2 beta lib3 Homo sapiens Baron_human inDrop Pancreas raw type B pancreatic cell CL:0000169 2293 6811 6811.0 3202 12.777907 0.007869
human2_lib3.final_cell_0215 2 beta lib3 Homo sapiens Baron_human inDrop Pancreas raw type B pancreatic cell CL:0000169 2104 5411 5411.0 3310 12.077229 0.004958
human2_lib3.final_cell_0543 2 beta lib3 Homo sapiens Baron_human inDrop Pancreas raw type B pancreatic cell CL:0000169 919 1718 1718.0 3638 15.293975 0.015793
human3_lib3.final_cell_0356 3 beta lib3 Homo sapiens Baron_human inDrop Pancreas raw type B pancreatic cell CL:0000169 1961 7078 7078.0 5788 12.467091 0.007258
human3_lib4.final_cell_0739 3 beta lib4 Homo sapiens Baron_human inDrop Pancreas raw type B pancreatic cell CL:0000169 1090 3331 3331.0 7105 14.733755 0.014210
human3_lib4.final_cell_0810 3 beta lib4 Homo sapiens Baron_human inDrop Pancreas raw type B pancreatic cell CL:0000169 1160 3007 3007.0 7176 14.287045 0.012557
human4_lib3.final_cell_0117 4 beta lib3 Homo sapiens Baron_human inDrop Pancreas raw type B pancreatic cell CL:0000169 2072 7516 7516.0 7984 11.545067 0.005538
human4_lib3.final_cell_0458 4 beta lib3 Homo sapiens Baron_human inDrop Pancreas raw type B pancreatic cell CL:0000169 1923 5317 5317.0 8325 15.428988 0.016008

Finally, we can use the annotate() method to obtain cell type predictions.

[13]:
lawlor_predictions = lawlor_hits.annotate("cell_ontology_class")

For the “Lawlor” dataset, we also have author provided “ground truth” cell type annotations.

By comparing with the “ground truth”, we see that the predictions are quite accurate.

[14]:
fig = cb.blast.sankey(
    lawlor.obs["cell_ontology_class"].values,
    lawlor_predictions.values.ravel(),
    title="Lawlor to Baron_human", tint_cutoff=2
)