Observatories and Research Facilities for EUropean Seismology
Volume 6, no 2 December 2004 Orfeus Newsletter

QuakeML - An XML schema for seismology

Danijel Schorlemmer, Adrian Wyss, Silvio Maraini, Stefan Wiemer, and Manfred Baer

Swiss Seismological Service, ETH Zürich, Switzerland

Abstract - Introduction - Why XML? - QuakeML - Creating a data format -
Advantages - Summary - Appendix - References

Abstract

We propose an extensible format definition for seismic data (QuakeML) using XML, the eXtensible Markup Language. Sharing data and seismic information efficiently is one of the most important issues for research and observational seismology in the future. Therefore, the seismological community needs a flexible, machine independent representation of seismic data to match with the needs of increased interconnectivity and real-time data exchange.

Introduction

Seismic data consist of epicenter parameters, station parameters, seismic phase readings, macroseismic observations, shake maps, seismic waveforms, velocity models of the Earth's structure, etc. Nowadays this data is generally represented either in binary or fixed column (ASCII) format. Both of these data representations are somewhat inflexible and, in the case of binary data, they are machine dependent. In addition, they are typically not designed in the face of standardized reading, e.g., use of consistent separators. While in the past computer speed and storage capacities were a strong argument in favor of binary representation of data, they are no longer a limiting factor today.

The eXtensible Markup Language (XML) is playing an increasingly important role in the exchange of a variety of data. Many business applications, especially stock exchanges, rely on XML as their backbone for data interchange. Due to its extensible definition capabilities, its wide acceptance and the existing large number of utilities and libraries for XML, the definition a 'QuakeML' standard to form a structured representation of all types of seismological data will be of great value.

Such a 'QuakeML' standard, properly defined as a multi-layer definition, could provide the community with one single standard format covering parameter, phase, and waveform data, according to the needs of the user. We propose a 3-layer definition of QuakeML: Layer 1 provides parameter data like multiple hypocenter location (e.g., automatic and manual locations), date/time, and magnitude. Additional information, e.g., improved quality information can optionally be added. This layer qualifies for seismic network bulletins and earthquake statistics research. Layer 2 adds pick times and related information, thus suitable for tomographic studies or extended network bulletins. Layer 3 contains waveforms, making it the most comprehensive description of an earthquake.

Why XML?

One of the major problems related to data exchange in seismology arises from the different needs in storing information. Although many earthquake parameters are common to most earthquake catalogs, these catalogs differ in their selection and format of stored parameters, making the definition of a practical common format for earthquake data almost impossible until now. For achieving one format for seismological data interchange, the underlying technique must allow for user specific extensions without compromising the format definition or without making the data files unreadable for other users. This restriction prohibits the use of tabulator separated or column-oriented ASCII-files.

A future, more versatile format should meet additional requirements. We propose to use open standards only in order to make the implementation platform and system independent. Furthermore, open source software and multi-platform tools should be available for working with data in the new format. This is important in order to assure royalty-free access to software that is needed to work with the new format.

We selected XML for the format definition because it meets all our requirements and it is already widely used in scientific and especially business applications. The switch to an XML data representation offers several advantages. The seismological community is traditionally quick to reconsider the computational setup, procedures and data handling as new technologies emerge. With the omnipresence of the Internet, data exchange has become a natural and easy procedure. However, remarkably we still use the old data formats and data exchange procedures. In the last years, the World Wide Web Consortium (W3C) developed numerous standards and recommendations for data representation and handling (see Appendix). They reflect the increasingly recognized needs for easy and flexible data exchange. Basically, XML is the center point of these technologies. It is not only a meta-language to describe object-oriented data representation designed for the use in the Internet. It is more: XML is probably the most flexible data representation. Its main advantages are:
  • Tagged ASCII-files: Any XML-file is a plain ASCII-file. The information is coded with tags. This makes XML-files human (and machine) readable and platform independent.
  • XML Schema (XSD): Schemas, itself expressed in XML, provide a comprehensive format definition language for describing own XML formats. They can be used to validate XML files with a parser.
  • Parser: A parser is a program that analyses the grammatical structure of an input, with respect to a given formal grammar, here the schema. Open source parsing and validating tools are available for many platforms as well as for many programming languages. Most XML parser uses the platform- and language-neutral interfaces Simple API for XML (SAX) or the Document Object Model (DOM) (Wood et al., 1999) to parse an XML document into objects of a programming language. A great variety of such interfaces exists for most programming languages: e.g., Xerxes (Java, C++), Expat (C), XML-fortan project implementing SAX (Fortran), and Open XML (Delphi, Pascal), offering a professional toolkit for working with XML-files.
  • Individual extensibility: Any XML-definition can readily integrate additional data. This makes individual extensions of QuakeML possible without compromising the validity. Considering the aforementioned layers as extensions, any program dealing with a certain layer of our definition can use any catalog with higher layer definitions. For example, import routines supporting layer 1 can without any modification import layer 2 files while ignoring the additional data fields.
  • Stylesheet transformation (XSLT): With XSLT, any XML-file can be transformed into another XML-file (e.g., separating certain values, performing queries), into HTML-pages for websites or web applications, or into simple ASCII-files (CSV style) for importing data into existing programs. XSLT (e.g., Xalan for Java/C++ and Saxon for Java) use eXtended Stylesheet Language (XSL) files as instruction. Using XSL-FO (formatting objects), PDF output is possible. No complex programs have to be written for transforming the information into web-suitable formats.
  • Binding: Binding provides a fast and convenient way to bind XML Schemas to a programming languages object model, making it easy for developers to incorporate XML data and processing functions into applications. The binding API translates the XML schema definitions into an object model of a programming language. Several binding APIs are available: e.g., JAXB (Java), the Castor Project (Java), and LMX (C++).
In general, XML-data files can be used to store data. When dealing with relatively large amounts of data, as commonly done in seismological observatories, simple file handling becomes unsuitable. In this case, the use of XML databases should be considered. Even SQL-databases can be used for data storage, either by developing suitable import and export filters, or by using an XML-wrapper component that converts XML-files into relational data structures. In the latter case, the SQL database behaves like an XML database. Many database applications provide XML support nowadays. Considering the fact that most observatories that use databases are already storing their data in SQL databases, import and export filter seem to be the appropriate solution. QuakeML itself is mainly designed for information interchange not as a storage format.

QuakeML

The QuakeML definition, described in the XSD schema language, is divided into several layers, see Figure 1. Layer 1, the basic layer, contains the necessary earthquake parameters as used in earthquake catalogs or bulletins and optionally basic quality descriptions. A preliminary layer 1 definition has been completed at the Swiss Seismological Service (SED) and is described in this article. Layer 2 is designed to extend layer 1 with pick information without changing any definition already made for layer 1. This work is in progress at the SED. Layer 2 is meant to be used for tomographic studies or extended earthquake catalogs with pick information. Layer 3 again extends layer 2 by adding waveforms. Here we propose to use the XML-SEED definition (Tsuboi and Morino, 2004) and hence incorporate XML-SEED into QuakeML.

Figure 1: Multi-layer approach.
Figure 1: Multi-layer approach. Every layer consists of mandatory and optional data fields. Layer 2 is an extension to layer 1.

A closer look at layer 1: Earthquake parameters

Every event in QuakeML consists of one or more locations. This offers the possibility to manage multiple locations from different sources and also to store all locations, from the first automatic over manual to revised locations. Every location consists of a unique identifier, origin time, latitude, longitude, depth, and magnitude. It is additionally accompanied with information about the author of this location, the type of event, and the region for matching the needs of earthquake bulletins. The origin time is separated into year, month, day, hour, minute, and second. Although an XML data type for date/time information exists, we choose to separate the values to facilitate storing of historic catalogs where a part of this information may not be available. When using the XML data type, a full date/time information would have to be given.

Usually, earthquake parameters in bulletins come as a plain ASCII file, see Figure 2.
DATE       TIME (UTC)  LAT   LON      Z   MAG    T  AUTHOR REGION

2004/09/28 17:15:24.0  35.8N 120.4W   7   M 6.0  M  NEIC   CENTRAL CALIFORNIA
2004/09/29 17:10:04.0  36.0N 120.5W  11   M 5.0  M  NEIC   CENTRAL CALIFORNIA
2004/09/30 18:54:28.0  36.0N 120.5W  10   M 5.0  M  NEIC   CENTRAL CALIFORNIA
Figure 2: Parkfield earthquake and two aftershocks in a plain ASCII representation.

The plain ASCII example data of Figure 2 may be translated into a QuakeML representation (Figure 3).
<quakeml>
  <event unique_id="EV_01">
    <location main="true" unique_id="LOC_01" analysis-type="M">
      <origin-date timezone="00:00">
        <year>2004</year>
        <month>09</month>
        <day>28</day>
        <hour>17</hour>
        <minute>15</minute>
        <seconds>24.0</seconds>
      </origin-date>
      <latitude error="0">35.8</latitude>
      <longitude error="0">-120.4</longitude>
      <depth unit="km" error="0">7</depth>
      <magnitude unit="M" error="0">6.0</magnitude>
      <region>CENTRAL CALIFORNIA</region>
      <author>NEIC</author>
    </location>
  </event>

  <event unique_id="EV_02">...</event>
  <event unique_id="EV_03">...</event>

</quakeml>
Figure 3: A simplified QuakeML data example.

The XML representation can now be validated against an XML Schema definition. Schemas are supporting rigorous definitions by offering the possibility of defining constraints for every data parameter and of providing metadata like physical units.

As can be seen, the location tag contains several attributes. The first attribute holds the information of whether or not the location is the prioritized main solution for an event among other solutions. The second attribute is a unique identifier. The last attribute classfies the location an automatically or manually derived solution. The first parameter tag of the location tag is the origin date, consisting of several sub-tags. The origin date tag has a time zone attribute. The geographic coordinates come with an error attribute. The value has to be floating-point values in the ranges from -180 to 180 and -90 to 90, respectively. The depth is also a floating-point value. Its physical unit is held in an attribute. The magnitude tag comes with two attributes. The first is a simple error value and second an indication about the magnitude type. The last two tags contain the author of the origin and a place name of the location. As mentioned before, this given selection of parameters may be extended very easily, as described in the next section.

Collecting earthquake lists, presenting them in the Internet and sending alarms is the goal of many web applications. To demonstrate the power of QuakeML, we developed such a web application at www.quakeml.ethz.ch. The core of this application is a QuakeML file as shown in Figure 3 and the corresponding XML schema is illustrated in the appendix.

QuakeML fits: Extension and customization

As mentioned before, our basic QuakeML definition can be extended with additional data fields (tags) to customize it according to different needs. An example of such a customization is discussed below. Because of a Switzerland specific Cartesian coordinate system used at the SED, we had to extend the data model as illustrated in Figure 4 and Figure 5. This extension is based on the original QuakeML schema; however, the extended data file can be read by any application, which implemented import according to the original QuakeML schema, because the additional data will simply be ignored.
<location main="true" unique_id="LOC999" analysis-type="M">
  <origin-date timezone="00:00">
    <year>2003</year>
    <month>02</month>
    <day>22</day>
    <hour>20</hour>
    <minute>41</minute>
  </origin-date>
  <latitude>48.4</latitude>
  <longitude>6.5</longitude>
  <magnitude unit="ML">5.5</magnitude>
  <region>FRANCE</region>
  <author>SED</author>
</location>

<my_location main="true" unique_id="LOC999" analysis-type="M">
  <origin-date timezone="00:00">
    <year>2004</year>
    <month>06</month>
    <day>21</day>
    <hour>23</hour>
    <minute>10</minute>
  </origin-date>
  <latitude>47.503</location>
  <longitude>7.711</location>
  <magnitude unit="ML">5.5</location>
  <region>FRANCE</region>
  <author>SED</author>
  <swissX>620</swissX>
  <swissY>261</swissY>
</my_location>
Figure 4: Simplified XML data and its extension.

Figure 5: XML Schema of the type 'location' and an 
example of a user-specific extension.
Figure 5: XML Schema of the type 'location' and an example of a user-specific extension.

Creating a data format

Software development is a complicated process and becomes increasingly complex with the number of people involved. A plan is an essential credential for a system like a new data format; without it, the software development process could spin out of control. Having a conceptual blueprint helps solving problems not only during the initial development stage but also when maintaining or revising the development.

The Unified Modeling Language (UML) (Rumbaugh et al., 1998) is a widely accepted graphical language for visualizing, specifying, constructing and documenting the structure of a software system or data model. It can be used as a blueprint for the development of QuakeML.

The following section describes a possible way from modeling seismic data into an XML Schema format. We show a three level design approach (see Figure 6) according to Routledge et al. (2002). These levels are software development levels and have nothing to do with the proposed information layers of QuakeML.

Figure 6: Three level design approach according to Routledge 
et al. (2002).
Figure 6: Three level design approach according to Routledge et al. (2002).

The first level is the conceptual level. With the help of a Use Case Diagram we group the seismic data and find their relationships. The result is a first structuring of a seismic data format as can be seen in Figure 7. A Use Case may become a future class or a package in the Class Diagrams. Because XML allows hierarchical data structures, we used arrows to indicate this hierarchy.

Figure 7: The conceptual level. A QuakeML document 
can contain one or more events, while each event can contain one or more locations.
Figure 7: The conceptual level. A QuakeML document can contain one or more events, while each event can contain one or more locations.

A class is an instance or map of a real world object. In the UML Class Diagram Model we show the abstract states and relationships of real objects. It is a very close representation of the real source code: the logical level of the three level design approach. This level consists of three steps according to Bird et al. (2000). In the first step we create simple data types. These data types are used like brick stones to build on. For example a geographical coordinate could be such a brick stone. In the next step we model more complex types, grouping major facts. These complex data types may correspond to the use cases of the conceptual level. For example a seismic location and its parameter could be such a complex data type. In the last step of the logical level we create elements inherited by the complex types and build the relations between this elements.

Figure 8: UML Class Diagram according to the logical level.
Figure 8: UML Class Diagram according to the logical level.

In the physical model we translate the logical model into an implementation language—in this case XML Schema (Figure 9).
<!-- location -->
<xs:complexType name="type_location">
  <xs:sequence>
    <xs:element name="origin-date" type="type_date" minOccurs="0"/>
    <xs:element name="latitude" type="types:type_LatLon" minOccurs="0"/>
    <xs:element name="longitude" type="types:type_LatLon" minOccurs="0"/>
    <xs:element name="depth" type="types:type_Depth" minOccurs="0"/>
    <xs:element name="magnitude" type="types:type_Magnitude" minOccurs="0"/>
    <xs:element name="region" type="xs:string" minOccurs="0"/>
    <xs:element name="author" type="xs:string" minOccurs="0"/>
    <xs:element name="methode" type="xs:string" minOccurs="0"/>
  </xs:sequence>
  <xs:attribute name="main" type="xs:boolean" default="true"/>
  <xs:attribute name="unique_id" type="xs:ID"/>
  <xs:attribute name="analysis-type" type="types:enu_LocType" use="optional" default="M"/>
</xs:complexType>
Figure 9: XML Schema representation of the logical model for event locations.

Advantages

Below we highlight some of the advantages of XML and QuakeML, respectively:

Historic catalogs

Missing data in catalogs of historic seismicity introduces problems when using fixed-column formats. Often only the day but not the time of an historic event is known. In fixed-column formats, this lack of information needs either special codes to reflect the missing data or columns are left blank, which creates problems in design of import filters. Additional semantic rules are required. Likewise errors in time and location can be large (years in historic catalogs or even thousands of years in paleoseismic analyses), sometimes extending beyond the originally anticipated fixed-column formatted data. In QuakeML, any information except the year is optional and can be extended by error information of any length and precision.

An additional possible layer which we did not define so far could store macroseismic information, thus extending layer 1 information for historic catalogs. For each event, all macroseismic observations could be grouped together. Due to the flexibility of XML, this extension could also be used for modern catalogs to add macroseismic information if available. Commonly the sizes of historic earthquakes are described through epicentral intensities instead of magnitudes. Compiling a catalog with only epicentral intensities per event instead of magnitudes would not compromise the QuakeML format and programs capable of importing layer 1 catalogs are able to import this catalog.

Different interpretations of the historic information can lead to totally different locations of an event. The ability of QuakeML to store multiple locations per event encourages keeping all available information in the catalog without loosing the possibility of easily importing the data. Combined catalogs, containing modern earthquake data as well as historic events are virtually the same from the technical point of view. So far, we do not know any catalog format definition that offers this flexibility.

XSLT

The eXtensible Stylesheet Transformation (XSLT) rounds off the concept of separating data from its presentation. While the data is stored in XML files, these files are not meant for presenting the data. This task can be accomplished in a very convenient way using stylesheets (XSL) and XSLT. The main advantage in this approach is the availability of fully developed XSLT engines. Only the stylesheets (an XML file again) need to be designed. With these stylesheets, the XSLT processor can generate almost any desired target format:
  • XML: It can generate again XML files, hereby performing queries to the original XML file or simple reshaping the files. Also sorting is possible.
  • HTML: A very common target format for presenting data on the web is HTML. Because HTML is also a 'tag'-format, transformation from XML to HTML is easily realized.
  • ASCII: For importing QuakeML data into existing software, which expects ASCII-files of specific format, these ASCII-files can also be generated with XSLT. The respective stylesheet can also contain sorting or querying capabilities. This output format can be considered the interface between modern XML representation of data and the 'old' more restricted ASCII-files.
  • SVG: The Scalable Vector Graphics (SVG) format is increasingly important for geographic information systems as well as in web-based applications. It is, like XSL or XSD, again an XML definition and describes vector images. Native SVG image editors are available and web-browsers are currently introducing SVG support.
  • RSS: News feeds (e.g., at the USGS) are implemented using RSS (again an XML definition). With XSLT, any earthquake information can readily be transformed to RSS.
  • PDF etc.: Even non-'tag' formats are possible with XSLT and associated toolkits. Using XML-FO (formatting objects), PDF output is possible. Also generation of JPG images is possible, e.g., plotting a marker at a certain position in an image.
This incomplete list highlights some of the possibilities of XSLT. XSLT-processors are readily available for many platforms (some under open-source licenses) and do not require development on the user's side. With XSLT, any earthquake information in QuakeML can automatically be transformed into multiple formats, for web presentation, news feeds, bulletins etc.

Individual extensions of QuakeML

Any XML definition can be extended in two ways, by including additional, user specific fields (tags) as described above or by including itself into another XML definition. QuakeML file can be included in any given XML file which is particular useful for generating earthquake alarms or event notifications in XML. In this case, an XML wrapper describing the alarm includes a full QuakeML description of one or more events with their respective locations. This QuakeML part can then easily be extracted using XSLT.

Summary

While the use of the internet and the web has quickly become indispensable for the seismological community, the use of other modern techniques such as XML has thus far been largely ignored.

We believe, that our community should make use of these new technologies and we hope that this article, introducing the advantages of XML itself and its potential use in seismology, encourages seismologists to consider XML representations of their data. At the Swiss Seismological Service, we are creating shakemaps based on XML-data and we started to implement earthquake parameter data, which will be extended with a new alarm system based on XML. Our QuakeML definition is meant to match the requirements of most seismological networks. Because implementing extensions remains an easy task and does not compromise the format, all networks could readily develop a QuakeML definition for their needs which is immediately readable for other users.

The value of QuakeML increases proportionally to the number of users and observatories applying it. Our vision and hope are that, because of its intrinsic advantages to previous seismological formats, QuakeML will become universally used.

Appendix

The History of XML

The eXtensible Markup Language (XML) is a puzzle piece of a long evolution started in the 1970's by Goldfarb (1973), E. Mosher, and R. Lorie at IBM. They worked on an integrated text processing system and invented the Standard General Markup Language (SGML) (Goldfarb, 1973), what became adapted by the ISO in 1986 (SGML, 1986). SGML is an extremely powerful and extensible tool for semantic markup which is particularly useful for cataloging and indexing data. It can be used to create other markup for example HTML or XML. The shortcoming of SGML is its complexity, especially for the everyday uses of the web. In 1989, T. Berners-Lee at CERN wrote an internal proposal about information management (Berners-Lee, 1989). He discusses the problem of loosing information in large scale projects. Wherefore HTML was originally designed to provide a very simple version of SGML. The same simple needs lead to the development of XML. In 1996, discussions began which focused on how to define a markup language with the power and extensibility of SGML but with the simplicity of HTML (Bray and Sperberg-McQueen, 1996). Like HTML, XML spread like a wildfire. To control this evolution, the World Wide Web Consortium was founded in 1994 by T. Berners-Lee.

The world of XML can be divided into three parts: The first one are the XML Accessories, languages which are intended for wide use to extend the capabilities specified in XML. Examples of XML accessories are the XML Schema language extending the definition capability of XML DTD. XML Transducers are languages which are intended for transducing some input XML data into some output form. Examples of XML transducers are the stylesheet languages CSS and XSL. XML Applications are languages which define constraints for a class of XML data for some special application area. Examples of XML applications are MathML defined for mathematical data (Figure 10).

Figure 10: Diagram showing the development and dependencies of SGML 
related technologies. XML and HTML are subsets of SGML.
Figure 10: Diagram showing the development and dependencies of SGML related technologies. XML and HTML are subsets of SGML.

XSL is an official recommendation of the World Wide Web Consortium (W3C). It provides a language to transform XML documents into other formats. These can be an HTML document, another XML document, a PDF, a JPEG file, or anything you want. The XSL stylesheet defines the transformation and an XSLT processor does the work. A first proposal (Adler et al., 1997) was submitted to the W3C in 1997 and has become a recommendation in 1999.

XML Schema was submitted to the W3C in 1999 by Malhotra and Maloney (1999) and is maintained by another working group. It has become a recommendation (Fallside, 2000). Schemas serve for describing the structure and constraining the contents of XML documents.

References

  • Adler, S., et al., A proposal for XSL, Tech. rep., W3C, 1997.
  • Berners-Lee, T., Information management: A proposal, Tech. rep., CERN, 1989.
  • Bird, L., Goodchild, A., and Halpin, T., Object role modelling and XML-Schema, Tech. rep., The University of Queensland, Australia, 2000.
  • Bray, T., and Sperberg-McQueen, C. M., Extensible Markup Language (XML), Working draft, W3C, 1996.
  • Fallside, D. C., XML Schema Part 0: Primer, Working draft, W3C, 2000.
  • Goldfarb, C. F., Design considerations for integrated text processing systems, IBM Cambridge Scientific Center Technical Report, 320-2094, 1973.
  • Malhotra, A., and Maloney, M., XML Schema requirements, Tech. rep., W3C, 1999.
  • Routledge, N., Bird, L., and Goodchild, A., UML and XML Schema, Tech. rep., University of Queensland, Australia, 2002.
  • Rumbaugh, J., Jacobson, I., and Booch, G., Unified Modeling Language Reference Manual, Addison-Wesley, 1998.
  • SGML, ISO 8879, 1986.
  • Tsuboi, S., and Morino, S., Conversion of SEED format to XML representation for a new standard of seismic waveform exchange, ORFEUS Electronic Newsletter, in this volume, 2004.
  • Wood, L., Apparao, V., Champion, M., Hors, A. L., Pixley, T., Robie, J., Sharpe, P., and Wilson, C., Document Object Model (DOM) Level 2 Specification Version 1.0, Working draft, W3C, 1999.
page 9
Copyright © 2004. Orfeus. All rights reserved.