The Lord of the Rings

# XML Basics with Python ## **Introduction to XML** XML stands for Extensible Markup Language. It's a text-based markup language derived from Standard Generalized Markup Language (SGML). XML tags identify the data and are used to store and organize the data, rather than specifying how to display it. It is designed to be self-descriptive and simple, yet robust and powerful. XML is a way to structure data for sharing between applications, making it an important part of web services, including SOAP, RSS, and RESTful APIs. **XML vs HTML** You might have heard of HTML (Hyper Text Markup Language), another markup language used primarily for creating web pages. While they share some similarities, they also have some crucial differences: * XML was designed to store and transport data, while HTML was designed to display data. * In XML, tags are not predefined; you must define your own tags, whereas in HTML, tags are predefined. * XML is case sensitive, whereas HTML is not. ### **XML Syntax** XML has a simple syntax and uses a set of rules for encoding documents in a format that is both human-readable and machine-readable. **Example:** Let's look at an XML document that contains information about a book: ```xml The Lord of the Rings J.R.R. Tolkien 1954 ``` This XML document includes the following components: * Prolog: `` - This is the XML prolog, it defines the XML version and the character encoding used in the document. * Root Element: `...` - This is the root element of the XML document, and it contains all other elements. * Child Elements: ``, `<author>`, and `<year>` - These are child elements of the `<book>` element. ### **XML Elements** XML elements form the building blocks of an XML document. They are defined by a start tag and an end tag with the content inserted in between. **Example:** In the XML document above, `<title>The Lord of the Rings` is an element. Here, `` is the start tag, `The Lord of the Rings` is the content, and `` is the end tag. **XML Attributes** XML elements can have attributes, which provide additional information about the element. **Example:** ```xml xmlCopy code The Lord of the Rings J.R.R. Tolkien 1954 ``` Here, the `book` element has an attribute `category` with a value of `fantasy`, and the `title` element has an attribute `lang` with a value of `en`. ## **Advanced XML Concepts** ### **XML Document Structure** XML documents form a tree-like structure that starts at "the root" and branches to "the leaves". This structure is called the XML tree. Elements can nest other elements, forming parent-child relationships. For example: ```xml Gambardella, Matthew XML Developer's Guide Computer 44.95 2000-10-01 An in-depth look at creating applications with XML. ``` In this example, `catalog` is the root element, `book` is a child element of `catalog`, and `author`, `title`, `genre`, `price`, `publish_date`, and `description` are child elements of `book`. ### **XML Validation** It's crucial for an XML document to be properly formatted or "well-formed". Beyond being well-formed, an XML document can also be valid. A valid XML document is one that conforms to a specific Document Type Definition (DTD) or XML Schema Definition (XSD). ```xml ]> Tove Jani Reminder Don't forget me this weekend! ``` In this example, we use a DTD to specify the structure of the `note` element. This ensures the `note` element always contains `to`, `from`, `heading`, and `body` elements in this specific order. ### **XML Namespaces** XML namespaces are used to avoid element name conflicts. When using prefixes in XML, a namespace for the prefix must be defined. The namespace can be defined by an xmlns attribute in the start tag of an element. ```xml Apples Bananas African Coffee Table 80 120 ``` In this example, we define two XML namespaces, one for HTML and one for furniture elements. This ensures there is no name collision between the two `table` elements. ## **XPath and XQuery** XPath and XQuery are languages used to select nodes from an XML document. XPath stands for XML Path Language, while XQuery stands for XML Query Language. They both provide ways to navigate through elements and attributes in an XML document. **XPath** XPath uses path expressions to select nodes in an XML document. The node is selected by following a path or steps. Here are some useful XPath expressions: * `/` Selects from the root node * `nodename` Selects nodes in the document with the name "nodename" * `//` Selects nodes in the document from the current node that match the selection no matter where they are * `.` Selects the current node * `..` Selects the parent of the current node * `@` Selects attributes Here is an example of using XPath in Python with the `lxml` library: ```python from lxml import etree # parse XML tree = etree.parse('books.xml') # select all 'book' nodes books = tree.xpath('//book') for book in books: print(etree.tostring(book)) # select 'title' of first 'book' title = tree.xpath('//book[1]/title/text()')[0] print('Title: ', title) ``` **XQuery** XQuery is used to extract data from XML documents. It can also be used for computations, such as arithmetic and string manipulations. It is a fully featured and powerful language, but it can be complex for beginners. Python's support for XQuery is limited, but you can use it in other contexts, such as in databases that support XML, like eXist-db. Here is an example of an XQuery: ```xml for $x in doc("books.xml")/catalog/book where $x/price > 30 return $x/title ``` This query will return the titles of all books in the `books.xml` document where the price is greater than 30. ## **Working with XML in Python** ### **Python XML Parsing** Python provides a number of modules for parsing XML, including the built-in `xml.etree.ElementTree` module (commonly shortened to `ElementTree` or `ET`), and the third-party `lxml` and `xmltodict` modules. Here's how to parse an XML document using `ElementTree`: ```python import xml.etree.ElementTree as ET tree = ET.parse('items.xml') root = tree.getroot() ``` In this example, the `parse` function parses the XML file and returns an `ElementTree` object. We then use the `getroot` method to get the root `Element` object of the XML document. ### **Navigating XML with ElementTree** Once you've parsed an XML document, you can navigate through its elements to read or modify its data. ```python for child in root: print(child.tag, child.attrib) for elem in root.iter('description'): print(elem.text) ``` In the first example, we iterate over the child elements of the root element, printing each element's tag name and attribute dictionary. In the second example, we use the `iter` method to find all 'description' elements at any depth within the root element. ### **Modifying XML with ElementTree** `ElementTree` allows you to modify XML by adding, modifying, or deleting elements and attributes. After making changes, you can write the modified XML tree back to a file. ```python for price in root.iter('price'): price.text = '9.99' tree.write('items.xml') ``` In this example, we iterate over all 'price' elements and change their text to '9.99'. We then write the modified XML tree back to the file 'items.xml'. ### **Parsing XML with xmltodict** `xmltodict` is a Python module that makes working with XML feel more like working with JSON (and therefore with Python dictionaries). It can parse an XML document into a dictionary, and serialize a dictionary into an XML document. ```python import xmltodict with open('items.xml') as fd: doc = xmltodict.parse(fd.read()) print(doc['root']['book']['title']) ``` In this example, we open the XML file and read its contents, then pass the contents to `xmltodict.parse`, which returns a Python dictionary. We can then access XML elements as we would dictionary keys. ### **XPath with Python lxml** XPath is a language for navigating XML documents and selecting elements. It's more powerful and flexible than the basic navigation methods provided by ElementTree. `lxml` is a Python XML processing library that is compatible with ElementTree but also supports XPath and other XML technologies. To use XPath with `lxml`, first parse your XML document using `lxml.etree` instead of `xml.etree.ElementTree`: ```python from lxml import etree tree = etree.parse('items.xml') root = tree.getroot() ``` Now you can use the `xpath` method to execute XPath expressions: ```python # select all 'book' elements books = root.xpath('//book') # select 'book' elements with a 'category' attribute of 'cooking' cooking_books = root.xpath('//book[@category="cooking"]') # select the 'title' element of the first 'book' element title = root.xpath('//book[1]/title')[0] ``` In the first example, `//book` selects all 'book' elements in the document. In the second example, `//book[@category="cooking"]` selects 'book' elements that have a 'category' attribute with a value of 'cooking'. In the third example, `//book[1]/title` selects the 'title' element of the first 'book' element. ### **Generating XML with ElementTree and lxml** Both ElementTree and lxml allow you to generate new XML documents from scratch. Here's how to create an XML document using ElementTree: ```python root = ET.Element('root') book = ET.SubElement(root, 'book') book.set('category', 'cooking') title = ET.SubElement(book, 'title') title.text = 'Everyday Italian' ET.ElementTree(root).write('items.xml') ``` This example creates a root 'root' element, then creates a 'book' subelement with a 'category' attribute, and a 'title' subelement with a text value. Finally, it writes the XML document to a file. Here's the equivalent code using lxml: ```python root = etree.Element('root') book = etree.SubElement(root, 'book') book.set('category', 'cooking') title = etree.SubElement(book, 'title') title.text = 'Everyday Italian' tree = etree.ElementTree(root) tree.write('items.xml', pretty_print=True) ``` This code does the same thing, but with the added benefit of the `pretty_print` option, which formats the XML with indentation for easier reading. ## **Working with XML Namespaces** XML namespaces are used to avoid name conflicts in XML documents. They are declared using the `xmlns` attribute in the start tag of an element. Here is an example: ```xml Apples Bananas African Coffee Table 80 120 ``` In this XML document, two namespaces are defined: '' and ''. They are assigned the prefixes 'h' and 'f', respectively. These prefixes are used to qualify the names of elements in these namespaces. Here is how to parse this XML document with lxml and access elements in these namespaces: ```python from lxml import etree # define namespace map nsmap = {'h': 'http://www.w3.org/TR/html4/', 'f': 'http://www.w3schools.com/furniture'} # parse XML tree = etree.parse('namespaces.xml') root = tree.getroot() # select 'table' elements in the 'http://www.w3.org/TR/html4/' namespace html_tables = root.xpath('//h:table', namespaces=nsmap) # select 'table' elements in the 'http://www.w3schools.com/furniture' namespace furniture_tables = root.xpath('//f:table', namespaces=nsmap) ``` The `xpath` method's `namespaces` argument is used to provide a mapping from namespace prefixes to namespace URIs. This mapping is used to resolve the namespace prefixes in the XPath expression. ## **JSON Conversions** JSON (JavaScript Object Notation) is a lightweight data-interchange format that's easy for humans to read and write and easy for machines to parse and generate. Given its compatibility with many languages and its simpler structure, JSON has become the de facto standard for server-client data exchange, replacing XML in many instances. However, there are still numerous occasions when data stored as XML needs to be processed in a JSON-friendly environment or vice versa. Let's see how we can achieve this in Python. We'll use the `xmltodict` module to convert XML data to JSON. This module essentially makes working with XML feel like you are working with JSON, as it maps XML data to Python's dictionary structure, which can then easily be converted to JSON. Here's an example: ```python import json import xmltodict def convert(xml_str): data_dict = xmltodict.parse(xml_str) json_data = json.dumps(data_dict) return json_data xml = """ Everyday Italian Giada De Laurentiis 2005 30.00 Harry Potter J K. Rowling 2005 29.99 """ print(convert(xml)) ``` Just as we converted XML to JSON in the previous section, we may also need to convert JSON to XML. We'll continue to use the `xmltodict` module for this conversion, as it provides a method `unparse()` for the reverse conversion. Let's take a look at how to convert JSON to XML: ```python import json import xmltodict def convert(json_str): data_dict = json.loads(json_str) xml_data = xmltodict.unparse(data_dict) return xml_data json_data = """ { "bookstore": { "book": [ { "@category": "COOKING", "title": { "@lang": "en", "#text": "Everyday Italian" }, "author": "Giada De Laurentiis", "year": "2005", "price": "30.00" }, { "@category": "CHILDREN", "title": { "@lang": "en", "#text": "Harry Potter" }, "author": "J K. Rowling", "year": "2005", "price": "29.99" } ] } } """ print(convert(json_data)) ``` In this example, we're first converting the JSON string to a Python dictionary using `json.loads()`, then converting the dictionary to an XML string with `xmltodict.unparse()`.