# Using SearchSDK and EntitySDK to Create Visualizations

### Example: Querying Search API for Datasets with the Data Types "CODEX" and "LCMS-Untargeted", Retrieving Info on the Corresponding Donors Via the Entity API, and Then Creating Visualizations of the Donor Info with Pandas and Plotly

We begin by importing the **hubmap-sdk**, pandas, json, and plotly. Next, we create an instance of both the search **SearchSdk** and **EntitySdk**. For this demo, we will set the service url to point at the DEV versions of the entity api and search api. By default, the service url is the PROD version of these web services. We will be only looking at publicly accessible endpoints and entitites, however if you had a Globus Groups Token with access to additional priviledges, you could add them as an argument with **'token='**

In [1]:
import hubmap_sdk
import pandas as pd
import json
from plotly import express as px

# Changing these pandas options would allow us to display an entire dataframe if we desired
pd.options.display.max_rows = None
pd.options.display.max_columns = None

search_instance = hubmap_sdk.SearchSdk(service_url='https://search.api.hubmapconsortium.org/')
entity_instance = hubmap_sdk.EntitySdk(service_url='https://entity.api.hubmapconsortium.org/')

Next we need to prepare a search query. For this demonstration, we will be searching for datasets of the data type 'CODEX' and getting back their uuid's.

In [2]:
search_query = {
  "query": {
    "match": {
      "data_types": "CODEX"
    }
  },
    "size": 50, # This value controls the number of results returned by the search api.
    "stored_fields": "_id" # For our purposes, we only need the ID of each dataset. 
}

Now that we have a query, we can use the **SearchSdk**. We pass the query as a dictionary object to the **search()** method of the search instance we created earlier. The output will be a dictionary, which we save as **results_dict**. 

The basic structure of the results dictionary is:

```python
{
    '_shards': {
        'failed':___,
        'skipped':___ ,
        'successful':___ ,
        'total':___ 
    },
    'hits': {
        'hits': [___],
        'max_score':___,
        'total': {
            'relation':___,
            'value':___
        }
    },
    'timed_out':___,
    'took':___
}
```

so we access the list of hits by using the top level key _'hits'_ and then the key one level down also called _'hits'_. We will save this as **list_of_hits**. Lastly we'll create an empty list **list_of_datasets** which we'll use in a moment

In [3]:
results_dict = search_instance.search(search_query)
list_of_hits = results_dict["hits"]["hits"]
list_of_datasets = []

For each dataset returned to us from the search sdk, we will retrieve its data via the **EntitySDK**. In a for-loop, we'll call the **get_entity()** method and pass it the uuid for each hit.This is given by the attribute **'\_id'**. We set that output to **dataset** and then add each dataset to **list_of_datasets**. Now we have a list of all the dataset objects that match our search query. 

In [4]:
for hit in list_of_hits:
    dataset = entity_instance.get_entity_by_id(hit["_id"])
    list_of_datasets.append(dataset)

Now that we have the datasets, we can find the donors that correspond to each dataset. We'll use  EntitySdk's **get_ancestors()** method and pass in the uuid of the dataset. Every dataset should have a donor as one of its ancestors. We can filter through all the ancestors (returned as a list) and focus on the ones with **entity_type** 'donor'. We'll store these donors in the dictionary **donors**. Then we can use the info created inside donors to create a json object containing **uuid**, **Age**, and **Sex** of each donor. We use this json object to create a pandas dataframe **df**. 

In [5]:
donors = {}
for each in list_of_datasets:
    ancestors = entity_instance.get_ancestors(each.uuid)
    for every in ancestors:
        if every.entity_type == "Donor":
            donors[every.uuid] = every
dataframe_json = []
for each in donors:
    series_json = {"uuid": donors[each].uuid}
    organ_donor_data = donors[each].metadata['organ_donor_data']
    for every in organ_donor_data:
        if every['grouping_concept_preferred_term'] == "Age":
            series_json['Age'] = every['data_value']
        if every['grouping_concept_preferred_term'] == "Sex":
            series_json['Sex'] = every['preferred_term']
    dataframe_json.append(series_json)
json_from_dict = json.dumps(dataframe_json)
df = pd.read_json(json_from_dict)

### Creating the Visualizations

Now that we have our dataframe, we can create visualizations from it using **plotly**. We can show the breakdown between Male and Female donor sex with a simple pie chart

In [6]:
donor_sex_piechart = px.pie(df, names="Sex", color_discrete_sequence=["cadetblue", "blanchedalmond"])
donor_sex_piechart.update_layout(title="Male/Female Breakdown of Donors Corresponding to Datasets of type 'CODEX'", font=dict(size=13))

donor_sex_piechart.show()

We can also create a bar graph from the ages of the donors. First, we'll group the ages into ranges "0-10", "11-20", etc. This is achieved by creating a new list and then adding the grouped age value to the list for each record in the data frame. This new list gets added as its own column, and we create the bar graph using this new column. We can even incorporate the donor sex breakdown into the age bar graph and show the split for each age category.

In [7]:
age_ranges = []
for ind in df.index:
    if df['Age'][ind] >= 81:
        age_ranges.append("81+")
    elif df['Age'][ind] < 11:
        age_ranges.append("0-10")
    elif df['Age'][ind] < 21:
        age_ranges.append("11-20")
    elif df['Age'][ind] < 31:
        age_ranges.append("21-30")
    elif df['Age'][ind] < 41:
        age_ranges.append("31-40")
    elif df['Age'][ind] < 51:
        age_ranges.append("41-50")
    elif df['Age'][ind] < 61:
        age_ranges.append("51-60")
    elif df['Age'][ind] < 71:
        age_ranges.append("61-70")
    elif df['Age'][ind] < 81:
        age_ranges.append("71-80")

df['age_range'] = age_ranges
bargraph = px.bar(df, x="age_range", color="Sex", color_discrete_sequence=["lightskyblue", "lightpink"], category_orders={"age_range": ["0-10", "11-20", "21-30", "31-40", "41-50", "51-60", "61-70", "71-80", "81+"]}, height=400)
bargraph.update_layout(title="Age Range Frequency of Donors Corresponding to Datasets of type 'CODEX' Split by Gender", font=dict(size=13))
bargraph.show()

We can create similar visualizations now using other data types such as **LC-MS-untargeted** by tweaking the search query. 

In [8]:
search_query_LCMS = {
  "query": {
    "match": {
      "data_types": "LC-MS-untargeted"
    }
  },
    "size": 200,
    "stored_fields": "_id"
}

Running the same code as above but with the new search query, we can get visualizations of the Age and Sex breakdowns for donors of datasets of the type **LC-MS-untargeted**. This code will be nearly identical to the code above so no need to break it down. For the sake of not confusing any variables from above, we've just prepended **LCMS_** to each variable name.  

In [9]:
LCMS_results_dict = search_instance.search(search_query_LCMS)
LCMS_list_of_hits = LCMS_results_dict["hits"]["hits"]
LCMS_list_of_datasets = []
for hit in LCMS_list_of_hits:
    LCMS_dataset = entity_instance.get_entity_by_id(hit["_id"])
    LCMS_list_of_datasets.append(LCMS_dataset)
LCMS_donors = {}
for each in LCMS_list_of_datasets:
    LCMS_ancestors = entity_instance.get_ancestors(each.uuid)
    for every in LCMS_ancestors:
        if every.entity_type == "Donor":
            LCMS_donors[every.uuid] = every
LCMS_dataframe_json = []
for each in LCMS_donors:
    LCMS_series_json = {"uuid": LCMS_donors[each].uuid}
    LCMS_organ_donor_data = LCMS_donors[each].metadata['organ_donor_data']
    for every in LCMS_organ_donor_data:
        if every['grouping_concept_preferred_term'] == "Age":
            LCMS_series_json['Age'] = every['data_value']
        if every['grouping_concept_preferred_term'] == "Sex":
            LCMS_series_json['Sex'] = every['preferred_term']
    LCMS_dataframe_json.append(LCMS_series_json)
LCMS_json_from_dict = json.dumps(LCMS_dataframe_json)
LCMS_df = pd.read_json(LCMS_json_from_dict)
LCMS_donor_sex_piechart = px.pie(LCMS_df, names="Sex", color_discrete_sequence=["aquamarine", "chartreuse"])
LCMS_donor_sex_piechart.update_layout(title="Male/Female Breakdown of Donors Corresponding to Datasets of type 'LC-MS-untargeted'", font=dict(size=13))
LCMS_donor_sex_piechart.show()
LCMS_age_ranges = []
for ind in LCMS_df.index:
    if LCMS_df['Age'][ind] >= 81:
        LCMS_age_ranges.append("81+")
    elif LCMS_df['Age'][ind] < 11:
        LCMS_age_ranges.append("0-10")
    elif LCMS_df['Age'][ind] < 21:
        LCMS_age_ranges.append("11-20")
    elif LCMS_df['Age'][ind] < 31:
        LCMS_age_ranges.append("21-30")
    elif LCMS_df['Age'][ind] < 41:
        LCMS_age_ranges.append("31-40")
    elif LCMS_df['Age'][ind] < 51:
        LCMS_age_ranges.append("41-50")
    elif LCMS_df['Age'][ind] < 61:
        LCMS_age_ranges.append("51-60")
    elif LCMS_df['Age'][ind] < 71:
        LCMS_age_ranges.append("61-70")
    elif LCMS_df['Age'][ind] < 81:
        LCMS_age_ranges.append("71-80")

LCMS_df['age_range'] = LCMS_age_ranges
LCMS_bargraph = px.bar(LCMS_df, x="age_range", color="Sex", color_discrete_sequence=["lightpink", "lightskyblue"], category_orders={"age_range": ["0-10", "11-20", "21-30", "31-40", "41-50", "51-60", "61-70", "71-80", "81+"]}, height=400)
LCMS_bargraph.update_layout(title="Age Range Frequency of Donors Corresponding to Datasets of type 'LC-MS-untargeted' Split by Gender", font=dict(size=13))
LCMS_bargraph.show()

### Conclusions 

These visualizations can help us draw conclusions from the data. Notice that the donors for LCMS-Untargeted datasets appear to have a higher percentage of females than the CODEX data sets. They also seem to generally older. We could just as easily chose other graphs to represent the data differently. If we wanted other data points besides age and sex, we would look for additional items for **grouped_concept_preferred_term** and add them to our data frame. Or if we wanted to view something besides donors, we could use a different EntitySDK method; if we, for example, wanted to view revisions of the datasets rather than their ancestors, we could use the **get_previous_revisions()** method. Say we wanted to search for different **data_types** entirely, search datasets on properties other than **data_types**, or even search for entities other than **datasets**. This could be achieved by modifying our search query. These tools and apis are flexible and we can view almost any statistics we can imagine