Skip to content

Instantly share code, notes, and snippets.

@yuanzhou
Last active April 3, 2024 18:40
Show Gist options
  • Save yuanzhou/09274afe4acfc1ec1ced1caf64e36f8a to your computer and use it in GitHub Desktop.
Save yuanzhou/09274afe4acfc1ec1ced1caf64e36f8a to your computer and use it in GitHub Desktop.
Indexing analysis 3/2/2024
1. Motivation
- Donor with only one sample but a large number of descendants (24798): https://portal.hubmapconsortium.org/browse/donor/3a0960c7cc8864dcc003165cef9ca040.json
- Donor with the biggest number of samples (14) and relatively small total descendants (101): https://portal.hubmapconsortium.org/browse/donor/4b257b9c1758a98af262c57bc0caa726.json
- Dataset with a lot of ancestors (168) and direct ancestors (72): https://portal.hubmapconsortium.org/browse/dataset/925886f2f50e70a639b95cecfaabac25.json
- Publication with a lot of ancestors (360): https://portal.hubmapconsortium.org/browse/publication/0de8528cf78f686a3e560c1860ddde63.json
2. Initial analysis/observation
- Each descendant/ancestor object only contains the neo4j properties, none seems to be `on_read_trigger` generated
- There are lots of for loops just to rename the fields in search-api: https://github.com/hubmapconsortium/search-api/blob/889402736a10d44d6efb21cdfd931f398f9a4a60/src/hubmap_translator.py#L940-L957
- `generate_display_subtype()` is executed for each entity during index runtime, seems unnecessary
3. In-depth analysis
Total 67 fields getting into ES, and the majority are the neo4j node properties, a few (6) are entity-api `on_read_trigger` generated
https://github.com/hubmapconsortium/search-api/blob/main/src/hubmap_translation/neo4j-to-es-attributes.json
[
"hubmap_id",
"entity_type",
"submission_id",
"lab_dataset_id",
"lab_tissue_sample_id",
"created_timestamp",
"uuid",
"registered_doi",
"contacts",
"creators",
"doi_url",
"published_timestamp",
"title",
"last_modified_timestamp",
"donor_metadata_status",
"sample_metadata_status",
"assay_metadata_status",
"data_metric_availability",
"data_processing_level",
"dataset_sign_off_status",
"rui_location",
"data_access_level",
"data_types",
"dataset_type",
"description",
"ingest_metadata",
"metadata",
"image_file_metadata",
"lab_donor_id",
"label",
"portal_metadata_upload_files",
"organ",
"organ_other",
"contains_human_genetic_sequences",
"protocol_url",
"group_uuid",
"created_by_user_displayname",
"created_by_user_email",
"status",
"visit",
"next_revision_uuid",
"next_revision_uuids",
"previous_revision_uuid",
"previous_revision_uuids",
"thumbnail_file",
"retraction_reason",
"sub_status",
"provider_info",
"dataset_info",
"contributors",
"tissue_type",
"dbgap_sra_experiment_url",
"dbgap_study_url",
"sample_category",
"publication_date",
"publication_doi",
"publication_url",
"publication_venue",
"volume",
"issue",
"pages_or_article_num",
"publication_status",
"omap_doi",
"error_message",
"associated_collection",
"creation_action",
"group_name"
]
In this above list, 6 are generated by entity-api on_read_trigger:
"associated_collection" (Publication only)
"creation_action"
"next_revision_uuid"
"next_revision_uuids"
"previous_revision_uuid"
"previous_revision_uuids"
Get all the neo4j node properties
```
MATCH (n:Entity)
WITH DISTINCT(keys(n)) as key_sets
UNWIND(key_sets) as keys
RETURN apoc.coll.toSet(COLLECT(keys))
```
Total 78 unique properties, including a few (7) are deprecared and no longer in use
[
"dataset_type",
"registered_doi",
"published_user_email",
"published_user_sub",
"published_user_displayname",
"dbgap_sra_experiment_url",
"dbgap_study_url",
"status",
"published_timestamp",
"lab_dataset_id",
"pipeline_message",
"contributors",
"dataset_info",
"entity_type",
"doi_url",
"ingest_metadata",
"run_id",
"data_types",
"uuid",
"hubmap_id",
"ingest_id",
"created_by_user_sub",
"created_timestamp",
"last_modified_timestamp",
"contains_human_genetic_sequences",
"last_modified_user_email",
"last_modified_user_sub",
"created_by_user_displayname",
"last_modified_user_displayname",
"created_by_user_email",
"description",
"group_name",
"data_access_level",
"contacts",
"group_uuid",
"local_directory_rel_path",
"portal_metadata_upload_files",
"protocol_url",
"image_file_metadata",
"lab_donor_id",
"label",
"metadata",
"submission_id",
"next_identifier",
"protocol_info",
"protocol_file",
"sample_category",
"organ",
"organ_other",
"lab_tissue_sample_id",
"visit",
"rui_location",
"title",
"status_history",
"validation_message",
"thumbnail_file",
"antibodies",
"provider_info",
"displayname",
"hm_uuid",
"open_consent",
"creators",
"created_by_user_display_name",
"image_files",
"data_acess_level",
"local_directory_url_path",
"metadata_files",
"publication_date",
"publication_status",
"publication_doi",
"publication_venue",
"publication_url",
"last_modified_useremail",
"omap_doi",
"issue",
"pages_or_article_num",
"assigned_to_group_name",
"ingest_task"
]
In the above list, 7 are deprecated and no longer being used:
"created_by_user_display_name"
"displayname"
"hm_uuid"
"last_modified_useremail"
"local_directory_url_path"
"protocol_file"
"protocol_info"
Compare the two lists, remove [] and commas, using http://www.listdiff.com/compare-2-lists-difference-tool
Lisa A: total 78 neo4j properties (including old deprecated ones)
"antibodies"
"assigned_to_group_name"
"contacts"
"contains_human_genetic_sequences"
"contributors"
"created_by_user_display_name"
"created_by_user_displayname"
"created_by_user_email"
"created_by_user_sub"
"created_timestamp"
"creators"
"data_access_level"
"data_acess_level"
"data_types"
"dataset_info"
"dataset_type"
"dbgap_sra_experiment_url"
"dbgap_study_url"
"description"
"displayname"
"doi_url"
"entity_type"
"group_name"
"group_uuid"
"hm_uuid"
"hubmap_id"
"image_file_metadata"
"image_files"
"ingest_id"
"ingest_metadata"
"ingest_task"
"issue"
"lab_dataset_id"
"lab_donor_id"
"lab_tissue_sample_id"
"label"
"last_modified_timestamp"
"last_modified_user_displayname"
"last_modified_user_email"
"last_modified_user_sub"
"last_modified_useremail"
"local_directory_rel_path"
"local_directory_url_path"
"metadata"
"metadata_files"
"next_identifier"
"omap_doi"
"open_consent"
"organ"
"organ_other"
"pages_or_article_num"
"pipeline_message"
"portal_metadata_upload_files"
"protocol_file"
"protocol_info"
"protocol_url"
"provider_info"
"publication_date"
"publication_doi"
"publication_status"
"publication_url"
"publication_venue"
"published_timestamp"
"published_user_displayname"
"published_user_email"
"published_user_sub"
"registered_doi"
"rui_location"
"run_id"
"sample_category"
"status"
"status_history"
"submission_id"
"thumbnail_file"
"title"
"uuid"
"validation_message"
"visit"
Lisa B: total 67 properties (some are no longer in use) getting into ES during indexing
"assay_metadata_status"
"associated_collection"
"contacts"
"contains_human_genetic_sequences"
"contributors"
"created_by_user_displayname"
"created_by_user_email"
"created_timestamp"
"creation_action"
"creators"
"data_access_level"
"data_metric_availability"
"data_processing_level"
"data_types"
"dataset_info"
"dataset_sign_off_status"
"dataset_type"
"dbgap_sra_experiment_url"
"dbgap_study_url"
"description"
"doi_url"
"donor_metadata_status"
"entity_type"
"error_message"
"group_name"
"group_uuid"
"hubmap_id"
"image_file_metadata"
"ingest_metadata"
"issue"
"lab_dataset_id"
"lab_donor_id"
"lab_tissue_sample_id"
"label"
"last_modified_timestamp"
"metadata"
"next_revision_uuid"
"next_revision_uuids"
"omap_doi"
"organ"
"organ_other"
"pages_or_article_num"
"portal_metadata_upload_files"
"previous_revision_uuid"
"previous_revision_uuids"
"protocol_url"
"provider_info"
"publication_date"
"publication_doi"
"publication_status"
"publication_url"
"publication_venue"
"published_timestamp"
"registered_doi"
"retraction_reason"
"rui_location"
"sample_category"
"sample_metadata_status"
"status"
"sub_status"
"submission_id"
"thumbnail_file"
"tissue_type"
"title"
"uuid"
"visit"
"volume"
Comparision results:
28 only in Lisa A:
"antibodies"
"assigned_to_group_name"
"created_by_user_display_name"
"created_by_user_sub"
"data_acess_level"
"displayname"
"hm_uuid"
"image_files"
"ingest_id"
"ingest_task"
"last_modified_user_displayname"
"last_modified_user_email"
"last_modified_user_sub"
"last_modified_useremail"
"local_directory_rel_path"
"local_directory_url_path"
"metadata_files"
"next_identifier"
"open_consent"
"pipeline_message"
"protocol_file"
"protocol_info"
"published_user_displayname"
"published_user_email"
"published_user_sub"
"run_id"
"status_history"
"validation_message"
17 only in List B:
"assay_metadata_status"
"associated_collection"
"creation_action"
"data_metric_availability"
"data_processing_level"
"dataset_sign_off_status"
"donor_metadata_status"
"error_message"
"next_revision_uuid"
"next_revision_uuids"
"previous_revision_uuid"
"previous_revision_uuids"
"retraction_reason"
"sample_metadata_status"
"sub_status"
"tissue_type"
"volume"
50 in both A and B:
"contacts"
"contains_human_genetic_sequences"
"contributors"
"created_by_user_displayname"
"created_by_user_email"
"created_timestamp"
"creators"
"data_access_level"
"data_types"
"dataset_info"
"dataset_type"
"dbgap_sra_experiment_url"
"dbgap_study_url"
"description"
"doi_url"
"entity_type"
"group_name"
"group_uuid"
"hubmap_id"
"image_file_metadata"
"ingest_metadata"
"issue"
"lab_dataset_id"
"lab_donor_id"
"lab_tissue_sample_id"
"label"
"last_modified_timestamp"
"metadata"
"omap_doi"
"organ"
"organ_other"
"pages_or_article_num"
"portal_metadata_upload_files"
"protocol_url"
"provider_info"
"publication_date"
"publication_doi"
"publication_status"
"publication_url"
"publication_venue"
"published_timestamp"
"registered_doi"
"rui_location"
"sample_category"
"status"
"submission_id"
"thumbnail_file"
"title"
"uuid"
"visit"
4. Proposals based on findings
Druing indexing, all we need are the node properties from Neo4j directly + a few `on_read_trigger` generated ones. But the current `GET /entitites/<id>` returns ALL trigger generated properties, unnecessary. And the index procedure has to remove the ones that are not specified in the mapping json, another unnecessary step. Then rename the fields using a mapping json file with lots of for loops, repetitive and total waste of time.
To address the above issues, create a specilized endpoint in entity-api: `GET /documents/<id>` which returns a subset of the regular entity json to be used as the index document, with only including the 6 `on_index_trigger` generated fields.
The end result is during index runtime we no longer need to generate all the fields from entity-api (takes more time) then remove those fields so they don't get into ES via `entity_keys_rename()`, which improves the performance and avoids unnecessary data fetching and manipulation.
In entity-api:
- Add a new endpoint `GET /documents/<id>` (based on the current `GET /entitites/<id>` but with no property filtering needed)
- Introduce a new trigger type: `on_index_trigger`, which has the same trigger method as regular `on_read_trigger`
- Only specify this trigger for fields that are supposed to get into ES via search-api index procedure
- Add a specialized schema_manager method `get_complete_document_result()` which is very similar to `get_complete_entity_result()`, but uses `on_index_trigger` instead. For now do NOT modify `get_complete_entity_result()` just to isolate the impact in case we mess things up.
- For caching, use a different prefix on those index data, `cache_key = f'{_memcached_prefix}_complete_index_{entity_uuid}'`
- Introduce a new flag in schema yaml `indexed: false` (default is true) for indexing purposes. This allows us to remove the use of json mapping in search-api. When `exposed: false` there's no need to check this `indexed: flase` since it's a field won't be exposed by entity-api. When there's only `indexed: false` it means this field still gets returned to the regular GET call, but not for the `GET /documents/<id>` since we won't index this field.
- Add a specialized schema_manager method `normalize_document_result_for_response()` based on the current `normalize_entity_result_for_response()` and integrate with the `indexed` flag.
By doing the above, following fileds no longer get generated by entity-api then get removed at search-api index runtime:
Donor (527):
- None, since no on_read_trigger
Sample (42371):
- direct_ancestor
Dataset (4801, including revisions):
- collections
- upload
- direct_ancestors
- local_directory_rel_path
Publication (9 published):
- None
Collection and Upload index procedure is very differetn from the above, similar to each other though.
`Collection.datasets` and `Upload.datasets` are both generated by `on_read_trigger`. This can be time-consuming when a collection has lots datasets. For instance, `3ae4ddfc175d768af5526a010bfe95aa` has 211 datasets, the GET request takes 8 seconds to generate a 3.6MB payload.
- Rename `Collection.dataset_uuids` (used by POST only) to `Collection.dataset_uuids_to_link` (no side effects since no Collection creation being used by other services). Also update the trigger method to use this new field. (Karl? Since he made Collection creation/update using the generic POST/PUT)
- Rename `Collection.datasets` to `Collection.dataset_uuids` with only returning a list of uuids (requires updating the neo4j quey and corresponding search-api tweaks)
- Rename `Upload.datasets` to `Upload.dataset_uuids` with only returning a list of uuids (requires updating the neo4j quey and corresponding search-api tweaks)
Additionally, move the field `display_subtype` logic from search-api to entity-api, this change will allow us to take the advantage of the entity-api caching.
- Donor: Fixed value "Donor"
- Upload: Fixed value "Data Upload"
- Dataset/Publication: `on_index_trigger` to generate from `dataset_type`
- Sample: sample_category == "organ", use organ code. Otherwise, capitalize('sample_category')
In search-api:
- First confirm some of those [17 only in List B] fields are no longer being used
- Remove the use of this mapping json (https://github.com/hubmapconsortium/search-api/blob/main/src/hubmap_translation/neo4j-to-es-attributes.json) and let entity-api schema yaml control what fields not getting into ES. Essentially only need to rename "ingest_metadata" -> "metadata".
- Update the index runtime procedure to use the new `GET /documents/<id>` call to replace the `GET /entitites/<id>`.
- Update the index runtime procedure for Collection and Upload to use the new fields `Collection.dataset_uuids` and `Upload.dataset_uuids`.
- Remove the use of `generate_display_subtype()` since the `display_subtype` will be returned by entity-api
5. Post-release actions
- Remove those deprecated neo4j properties, which also benefits the neo4j v5 upgrade
- Code profiling to measure the performance improvement for both individual index and full index
- Can we move away from the [:USES_DATA] workaround and link publication to a large number of datasets as direct ancestors?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment