Parsing XMLs in Python safely

Parsing XMLs in Python safely#

Suppose we want to find values of a specific field from a xml webpage. For example, let’s extract units of some of the attributes that are available in this item on ScienceBase. Here’s how we can achieve this using defusedxml and AsyncRetriever safely:

import async_retriever as ar
import defusedxml.ElementTree as ET
import pandas as pd

# A list of urls and payloads of the tagret XMLs
urls, kwds = zip(
    *[
        (
            "https://www.sciencebase.gov/catalog/file/get/57867b1be4b0e02680c14ff6",
            {"f": "__disk__e9/8b/ec/e98becc07c2de2b27396f1baefb29fa233b8f0f7"},
        ),
        (
            "https://www.sciencebase.gov/catalog/file/get/5785595ce4b0e02680bf2fd8",
            {"f": "__disk__e7/8b/7d/e78b7dbf4c31e60de6696697e62c04ee688a56d3"},
        ),
        (
            "https://www.sciencebase.gov/catalog/file/get/57dafd3ae4b090824ffc32f1",
            {"f": "__disk__9b/e6/04/9be604f55425b2691158260123a914cd0efae0da"},
        ),
        (
            "https://www.sciencebase.gov/catalog/file/get/573b5344e4b0dae0d5e3ad9c",
            {"f": "__disk__9f/5c/50/9f5c50b2f0da613b0969c574716d13b67903e274"},
        ),
        (
            "https://www.sciencebase.gov/catalog/file/get/57ffb392e4b0824b2d16f4c6",
            {"f": "__disk__dc/e8/d6/dce8d6f3b41e9ec33f9364a352e5e73055cfdc92"},
        ),
    ]
)

# Async retrieval
xmls = ar.retrieve(urls, "text", [{"params": p} for p in kwds])

# Parsing the results and converting to a Pandas dataframe
units = []
for xml in xmls:
    root = ET.fromstring(xml)
    for item in root.findall("./eainfo/detailed/attr"):
        for v in item.find("attrdomv"):
            if v.tag == "rdom":
                for u in v.findall("attrunit"):
                    units.append((item.find("attrlabl").text, u.text.lower()))
units = pd.DataFrame(units, columns=["attr", "unit"])

The first five rows of the obtained dataframe is shown below:

attr

unit

0

CAT_HLR_1

percent

1

CAT_HLR_2

percent

2

CAT_HLR_3

percent

3

CAT_HLR_4

percent

4

CAT_HLR_5

percent