orange plastic blocks on white surface
Thu Mar 23

Introduction to XML

If you are a software developer, you should have been familiar with XML. XML (Extensible Markup Language) is a very common and widely used format for organizing and sharing in a structured manner. This article will not include every detail of the XML explanation, but hopefully it will be sufficient to give you comprehension of what we are going to discuss in another article about [XXE injection](https://binaryte.com/blog/xml-external-entities-xxe- injection).

What is XML?

XML is designed to be readable by both humans and machines. Essentially XML is designed for distributing or transporting data and sometimes used as storage. In spite of it being similar to HTML, it differs by the way they use the tags.

XML structure

XML metadata

At the very top of the XML file, you could find something like this.

<?xml version=”1.0” encoding=”UTF-8”?>

This is called metadata, where you can define the xml version and the encoding you would like to use.

Root element

XML documents are structured just like a “tree” which only has one “root” and can contain several “branches” and “leaves”.
Example:

<root>
  <branch1>
    <leaf1></leaf1>
    <leaf2></leaf2>
  </branch1>
  <branch2></branch2>
</root>

No predefined tags

Unlike HTML, XML doesn’t use predefined tags. Hence you can use arbitrary words to name the tags. In contrast, HTML may use tags like <p>, <h1>, and so on while in XML, you can name the tag anything you want like <name> or <city>. However, XML still needs the closing tag the way HTML does.

**Case sensitive **

As mentioned before, you can set any name to the tag with XML. However, both opening and closing tags need to be the same. It means that the tag <name> should be closed with </name>. You can’t use something like </Name>.

**Using entity references **

Some characters have special meaning in XML. As a substitute, you need to use the entity references for these characters.

CharacterEntity reference
<<
>>
&&
''
&quot

Example:

<object>Bob’s car</object> <!--not allowed-->
<object>Bob&apos;s car</object> <!--allowed-->

Using attributes values

The attribute value should be defined inside a single quote or double quote. A single tag may also have one or more attributes in it (e.g. <name gender=”male” birth=”07-07-2007”>Bob</name>).

XML Entity and DTD

XML Internal Entities

Previously, we have talked about the entity reference. You can think of it like a variable in any programming language. The five characters mentioned earlier, can be considered as predefined entities or character entities. If there are such things like predefined entities, does it mean there is a defined entity, the entity that we make it on our own? The answer is yes.

By using DTD (Document Type Definition), you can make your own entity. DTD is below XML metadata. Let’s see how it works with the example below.

<?xml version=”1.0” encoding=”UTF-8”?>
<!DOCTYPE person [ <!ENTITY name “Bob”> ]>

<person>
    <name>&name;</name>
    <gender>Male</gender>
</person>

The DOCTYPE is used to tell the XML parser that this is the DTD. Next, we use the ENTITY to tell the parser that we are going to define our own entity that is called “name”, with the value of “Bob”. In the XML element, we can use those entities by embedding the entity name with ampersand (&) at the start and semicolon (;) at the end.

XML External Entities

The entity can contain many types of data. Previously, we just used them to refer to the data as a simple entity value like “Bob”. Because this entity is defined locally, it is considered as an internal entity.

In fact, the data you can define as an entity is not limited to the internal entity, but can also be obtained from the outside. This is what we call external entities. You can include other sources as well, like URI (e.g. “file:///path/to/secret.txt”) and URL (e.g. “https://example.com/”).

For example, have a look at the XML below.

<?xml version="1.0" ?>
<!DOCTYPE uri [ <!ENTITY secret SYSTEM "file:///path/to/secret.txt>” ]>
<uri>&secret;</uri>

The keyword SYSTEM is used to indicate to the XML parser that the entity is external. It can then be followed by the corresponding URI (Uniform Resource Identifier) or URL (Uniform Resource Locator). Then, you can use this entity the same way as the previous example

XML Parameter Entities

Previously, we have learned about how the internal and external entities work. Unlike both, the parameter entity is exclusively used in DTD. It is distinguished by the use of % (percent sign) instead of ampersand (&).

<?xml version="1.0" ?>
<!DOCTYPE text [
  <!ENTITY % uri "<!ENTITY secret SYSTEM 'file:///path/to/secret.txt'>">
  %uri;
]>
<text>&secret;</text>

This is the same as this one.

<?xml version="1.0" ?>
<!DOCTYPE text [
  <!ENTITY % uri "<!ENTITY secret SYSTEM 'file:///path/to/secret.txt'>">
  <!ENTITY secret SYSTEM 'file:///path/to/secret.txt'>
]>
<text>&secret;</text>

This is important to note that you need to use this percent sign (%) and space ( ) to name the parameter entity (e.g <!ENTITY % uri “blah”> is valid, <!ENTITY %uri “blah”> is not)

**Summary **

Now, we have learned the basics of XML which is essential for anyone working in the field of information technology. XML is widely used in web development to create websites, web services, and RSS feeds. It is also used for data exchange between different systems and programming languages. In addition, XML plays a crucial role in the development of other technologies such as SOAP and SVG.