Cell BLAST tutorial
[1]:
import time
import warnings
import anndata
import Cell_BLAST as cb
warnings.filterwarnings("ignore")
cb.config.N_JOBS = 4
cb.config.RANDOM_SEED = 0
Preparing database
In this tutorial, we demonstrate how to perform Cell BLAST based on DIRECTi models.
Again, we use the human pancreatic islet dataset Baron_human as an example.
[2]:
baron_human = anndata.read_h5ad("Baron_human.h5ad")
[3]:
%%capture
axes = cb.data.find_variable_genes(baron_human, grouping="donor")
Cell BLAST uses multiple models to increase specificity.
Here we first train 4 DIRECTi models, each with a different random seed.
Please refer to the accompanying DIRECTi notebook for more detailed introduction to model training.
[4]:
%%capture
start_time=time.time()
models = []
for i in range(4):
models.append(cb.directi.fit_DIRECTi(
baron_human, genes=baron_human.var.query("variable_genes").index,
latent_dim=10, cat_dim=20, random_seed=i
))
[INFO] Cell BLAST: Using model path: /tmp/tmp3k89_jx1
[INFO] Cell BLAST: Using model path: /tmp/tmpm1gqyon4
[INFO] Cell BLAST: Using model path: /tmp/tmpj6zf6z9g
[INFO] Cell BLAST: Using model path: /tmp/tmpglx_f5fl
[5]:
print("Time elapsed: %.1fs" % (time.time() - start_time))
Time elapsed: 768.9s
Then we build a Cell BLAST “database” by feeding our previously trained models and the reference dataset.
[6]:
blast = cb.blast.BLAST(models, baron_human)
[INFO] Cell BLAST: Projecting to latent space...
[INFO] Cell BLAST: Fitting nearest neighbor trees...
[INFO] Cell BLAST: Sampling from posteriors...
/home/lurw/anaconda3/envs/cbp/lib/python3.9/site-packages/torch/nn/modules/container.py:435: UserWarning: Setting attributes on ParameterList is not supported.
warnings.warn("Setting attributes on ParameterList is not supported.")
/home/lurw/anaconda3/envs/cbp/lib/python3.9/site-packages/torch/nn/modules/container.py:435: UserWarning: Setting attributes on ParameterList is not supported.
warnings.warn("Setting attributes on ParameterList is not supported.")
/home/lurw/anaconda3/envs/cbp/lib/python3.9/site-packages/torch/nn/modules/container.py:435: UserWarning: Setting attributes on ParameterList is not supported.
warnings.warn("Setting attributes on ParameterList is not supported.")
/home/lurw/anaconda3/envs/cbp/lib/python3.9/site-packages/torch/nn/modules/container.py:435: UserWarning: Setting attributes on ParameterList is not supported.
warnings.warn("Setting attributes on ParameterList is not supported.")
[INFO] Cell BLAST: Generating empirical null distributions...
Like DIRECTi models, BLAST objects can be easily saved and loaded.
[7]:
blast.save("./baron_human_blast")
del blast
blast = cb.blast.BLAST.load("./baron_human_blast")
[INFO] Cell BLAST: Using model path: ./baron_human_blast/model_0
[INFO] Cell BLAST: Using model path: ./baron_human_blast/model_1
[INFO] Cell BLAST: Using model path: ./baron_human_blast/model_2
[INFO] Cell BLAST: Using model path: ./baron_human_blast/model_3
[INFO] Cell BLAST: Fitting nearest neighbor trees...
Querying
We load another human pancreatic islet dataset Lawlor to demonstrate the querying process.
Note that we do NOT perform data normalization or gene subsetting here. These should be internally handled by the BLAST object later in querying.
[8]:
lawlor = anndata.read_h5ad("Lawlor.h5ad")
To query the database, we first use the query() method to obtain initial hits in the reference database. This is done by efficient Euclidean distance based nearest neighbor search in the latent space. Nearest neighbors in the latent space of each model will be merged. Though highly efficient, latent space Euclidean distance is not the best metric to determine cell-cell similarity. To increase accuracy and specificity, we also compute posterior distribution distances as well as empirical p-values for these nearest neighbors.
[9]:
start_time = time.time()
lawlor_hits = blast.query(lawlor)
print("Time per query: %.1fms" % (
(time.time() - start_time) * 1000 / lawlor.shape[0]
))
[INFO] Cell BLAST: Projecting to latent space...
[WARNING] Cell BLAST: 21 out of 769 variables are not found, will be set to zero!
[INFO] Cell BLAST: 'ADGRF5', 'ADGRL4', 'CD24', 'CEMIP', 'CXCL8', 'H19', 'HLA.B', 'HLA.DPA1', 'HLA.DPB1', 'HLA.DRA', 'HLA.DRB1', 'MIR4435.2HG', 'NEURL3', 'PCAT19', 'PECAM1', 'PRSS2', 'SMIM24', 'TPSB2', 'UCA1', 'UNQ6494', 'pk'
[WARNING] Cell BLAST: 21 out of 769 variables are not found, will be set to zero!
[INFO] Cell BLAST: 'ADGRF5', 'ADGRL4', 'CD24', 'CEMIP', 'CXCL8', 'H19', 'HLA.B', 'HLA.DPA1', 'HLA.DPB1', 'HLA.DRA', 'HLA.DRB1', 'MIR4435.2HG', 'NEURL3', 'PCAT19', 'PECAM1', 'PRSS2', 'SMIM24', 'TPSB2', 'UCA1', 'UNQ6494', 'pk'
[WARNING] Cell BLAST: 21 out of 769 variables are not found, will be set to zero!
[INFO] Cell BLAST: 'ADGRF5', 'ADGRL4', 'CD24', 'CEMIP', 'CXCL8', 'H19', 'HLA.B', 'HLA.DPA1', 'HLA.DPB1', 'HLA.DRA', 'HLA.DRB1', 'MIR4435.2HG', 'NEURL3', 'PCAT19', 'PECAM1', 'PRSS2', 'SMIM24', 'TPSB2', 'UCA1', 'UNQ6494', 'pk'
[WARNING] Cell BLAST: 21 out of 769 variables are not found, will be set to zero!
[INFO] Cell BLAST: 'ADGRF5', 'ADGRL4', 'CD24', 'CEMIP', 'CXCL8', 'H19', 'HLA.B', 'HLA.DPA1', 'HLA.DPB1', 'HLA.DRA', 'HLA.DRB1', 'MIR4435.2HG', 'NEURL3', 'PCAT19', 'PECAM1', 'PRSS2', 'SMIM24', 'TPSB2', 'UCA1', 'UNQ6494', 'pk'
[INFO] Cell BLAST: Doing nearest neighbor search...
[INFO] Cell BLAST: Merging hits across models...
[INFO] Cell BLAST: Computing posterior distribution distances...
/home/lurw/anaconda3/envs/cbp/lib/python3.9/site-packages/torch/nn/modules/container.py:435: UserWarning: Setting attributes on ParameterList is not supported.
warnings.warn("Setting attributes on ParameterList is not supported.")
/home/lurw/anaconda3/envs/cbp/lib/python3.9/site-packages/torch/nn/modules/container.py:435: UserWarning: Setting attributes on ParameterList is not supported.
warnings.warn("Setting attributes on ParameterList is not supported.")
/home/lurw/anaconda3/envs/cbp/lib/python3.9/site-packages/torch/nn/modules/container.py:435: UserWarning: Setting attributes on ParameterList is not supported.
warnings.warn("Setting attributes on ParameterList is not supported.")
/home/lurw/anaconda3/envs/cbp/lib/python3.9/site-packages/torch/nn/modules/container.py:435: UserWarning: Setting attributes on ParameterList is not supported.
warnings.warn("Setting attributes on ParameterList is not supported.")
[WARNING] Cell BLAST: 21 out of 769 variables are not found, will be set to zero!
[INFO] Cell BLAST: 'ADGRF5', 'ADGRL4', 'CD24', 'CEMIP', 'CXCL8', 'H19', 'HLA.B', 'HLA.DPA1', 'HLA.DPB1', 'HLA.DRA', 'HLA.DRB1', 'MIR4435.2HG', 'NEURL3', 'PCAT19', 'PECAM1', 'PRSS2', 'SMIM24', 'TPSB2', 'UCA1', 'UNQ6494', 'pk'
[WARNING] Cell BLAST: 21 out of 769 variables are not found, will be set to zero!
[INFO] Cell BLAST: 'ADGRF5', 'ADGRL4', 'CD24', 'CEMIP', 'CXCL8', 'H19', 'HLA.B', 'HLA.DPA1', 'HLA.DPB1', 'HLA.DRA', 'HLA.DRB1', 'MIR4435.2HG', 'NEURL3', 'PCAT19', 'PECAM1', 'PRSS2', 'SMIM24', 'TPSB2', 'UCA1', 'UNQ6494', 'pk'
[WARNING] Cell BLAST: 21 out of 769 variables are not found, will be set to zero!
[INFO] Cell BLAST: 'ADGRF5', 'ADGRL4', 'CD24', 'CEMIP', 'CXCL8', 'H19', 'HLA.B', 'HLA.DPA1', 'HLA.DPB1', 'HLA.DRA', 'HLA.DRB1', 'MIR4435.2HG', 'NEURL3', 'PCAT19', 'PECAM1', 'PRSS2', 'SMIM24', 'TPSB2', 'UCA1', 'UNQ6494', 'pk'
[WARNING] Cell BLAST: 21 out of 769 variables are not found, will be set to zero!
[INFO] Cell BLAST: 'ADGRF5', 'ADGRL4', 'CD24', 'CEMIP', 'CXCL8', 'H19', 'HLA.B', 'HLA.DPA1', 'HLA.DPB1', 'HLA.DRA', 'HLA.DRB1', 'MIR4435.2HG', 'NEURL3', 'PCAT19', 'PECAM1', 'PRSS2', 'SMIM24', 'TPSB2', 'UCA1', 'UNQ6494', 'pk'
[INFO] Cell BLAST: Computing empirical p-values...
Time per query: 72.5ms
Then we use reconcile_models() to pool together informarion from multiple models and filter() the initial hits to obtain significant hits.
[10]:
lawlor_hits = lawlor_hits.reconcile_models().filter(by="pval", cutoff=0.05)
Optionally, we may use the to_data_frames() method to extract detailed information about the query hits.
The return value is a python dict, with query cell names as keys and meta table of query hits as values.
[11]:
hits_dict = lawlor_hits[0:5].to_data_frames()
hits_dict.keys()
[11]:
odict_keys(['1st-61_S27', '1st-C11_S58', '1st-C13_S19', '1st-C15_S3', '1st-C18_S51'])
[12]:
hits_dict["1st-61_S27"]
[12]:
donor | cell_type1 | library | organism | dataset_name | platform | organ | data_type | cell_ontology_class | cell_ontology_id | n_genes | n_counts | __libsize__ | hits | dist | pval | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
cell_id | ||||||||||||||||
human1_lib1.final_cell_0078 | 1 | beta | lib1 | Homo sapiens | Baron_human | inDrop | Pancreas | raw | type B pancreatic cell | CL:0000169 | 2898 | 8881 | 8881.0 | 77 | 11.782958 | 0.005310 |
human1_lib1.final_cell_0170 | 1 | beta | lib1 | Homo sapiens | Baron_human | inDrop | Pancreas | raw | type B pancreatic cell | CL:0000169 | 2478 | 7664 | 7664.0 | 169 | 8.440894 | 0.001088 |
human1_lib2.final_cell_0240 | 1 | beta | lib2 | Homo sapiens | Baron_human | inDrop | Pancreas | raw | type B pancreatic cell | CL:0000169 | 2096 | 6842 | 6842.0 | 807 | 14.762918 | 0.014100 |
human1_lib2.final_cell_0274 | 1 | beta | lib2 | Homo sapiens | Baron_human | inDrop | Pancreas | raw | type B pancreatic cell | CL:0000169 | 2030 | 5178 | 5178.0 | 841 | 12.932139 | 0.007728 |
human1_lib3.final_cell_0339 | 1 | beta | lib3 | Homo sapiens | Baron_human | inDrop | Pancreas | raw | type B pancreatic cell | CL:0000169 | 1981 | 7872 | 7872.0 | 1535 | 15.405832 | 0.016093 |
human2_lib1.final_cell_0254 | 2 | beta | lib1 | Homo sapiens | Baron_human | inDrop | Pancreas | raw | type B pancreatic cell | CL:0000169 | 2522 | 7224 | 7224.0 | 2190 | 10.192962 | 0.003189 |
human2_lib2.final_cell_0061 | 2 | beta | lib2 | Homo sapiens | Baron_human | inDrop | Pancreas | raw | type B pancreatic cell | CL:0000169 | 2963 | 6470 | 6470.0 | 2556 | 11.842522 | 0.005558 |
human2_lib2.final_cell_0398 | 2 | beta | lib2 | Homo sapiens | Baron_human | inDrop | Pancreas | raw | type B pancreatic cell | CL:0000169 | 1617 | 4644 | 4644.0 | 2893 | 17.342208 | 0.023050 |
human2_lib3.final_cell_0107 | 2 | beta | lib3 | Homo sapiens | Baron_human | inDrop | Pancreas | raw | type B pancreatic cell | CL:0000169 | 2293 | 6811 | 6811.0 | 3202 | 12.777907 | 0.007869 |
human2_lib3.final_cell_0215 | 2 | beta | lib3 | Homo sapiens | Baron_human | inDrop | Pancreas | raw | type B pancreatic cell | CL:0000169 | 2104 | 5411 | 5411.0 | 3310 | 12.077229 | 0.004958 |
human2_lib3.final_cell_0543 | 2 | beta | lib3 | Homo sapiens | Baron_human | inDrop | Pancreas | raw | type B pancreatic cell | CL:0000169 | 919 | 1718 | 1718.0 | 3638 | 15.293975 | 0.015793 |
human3_lib3.final_cell_0356 | 3 | beta | lib3 | Homo sapiens | Baron_human | inDrop | Pancreas | raw | type B pancreatic cell | CL:0000169 | 1961 | 7078 | 7078.0 | 5788 | 12.467091 | 0.007258 |
human3_lib4.final_cell_0739 | 3 | beta | lib4 | Homo sapiens | Baron_human | inDrop | Pancreas | raw | type B pancreatic cell | CL:0000169 | 1090 | 3331 | 3331.0 | 7105 | 14.733755 | 0.014210 |
human3_lib4.final_cell_0810 | 3 | beta | lib4 | Homo sapiens | Baron_human | inDrop | Pancreas | raw | type B pancreatic cell | CL:0000169 | 1160 | 3007 | 3007.0 | 7176 | 14.287045 | 0.012557 |
human4_lib3.final_cell_0117 | 4 | beta | lib3 | Homo sapiens | Baron_human | inDrop | Pancreas | raw | type B pancreatic cell | CL:0000169 | 2072 | 7516 | 7516.0 | 7984 | 11.545067 | 0.005538 |
human4_lib3.final_cell_0458 | 4 | beta | lib3 | Homo sapiens | Baron_human | inDrop | Pancreas | raw | type B pancreatic cell | CL:0000169 | 1923 | 5317 | 5317.0 | 8325 | 15.428988 | 0.016008 |
Finally, we can use the annotate() method to obtain cell type predictions.
[13]:
lawlor_predictions = lawlor_hits.annotate("cell_ontology_class")
For the “Lawlor” dataset, we also have author provided “ground truth” cell type annotations.
By comparing with the “ground truth”, we see that the predictions are quite accurate.
[14]:
fig = cb.blast.sankey(
lawlor.obs["cell_ontology_class"].values,
lawlor_predictions.values.ravel(),
title="Lawlor to Baron_human", tint_cutoff=2
)