QuakeML - An XML schema for seismology
Danijel Schorlemmer, Adrian Wyss, Silvio Maraini, Stefan Wiemer, and Manfred Baer
Swiss Seismological Service, ETH Zürich, Switzerland
We propose an extensible format definition for seismic data
(QuakeML) using XML,
the eXtensible Markup Language. Sharing data and seismic information efficiently
is one of the most important issues for research and observational seismology in
the future. Therefore, the seismological community needs a flexible, machine
independent representation of seismic data to match with the needs of increased
interconnectivity and real-time data exchange.
Seismic data consist of epicenter parameters, station parameters, seismic phase
readings, macroseismic observations, shake maps, seismic waveforms, velocity
models of the Earth's structure, etc. Nowadays this data is generally represented
either in binary or fixed column (ASCII) format. Both of these data
representations are somewhat inflexible and, in the case of binary data, they are
machine dependent. In addition, they are typically not designed in the face of
standardized reading, e.g., use of consistent separators. While in the past
computer speed and storage capacities were a strong argument in favor of binary
representation of data, they are no longer a limiting factor today.
The eXtensible Markup Language (XML) is playing an increasingly important role in
the exchange of a variety of data. Many business applications, especially stock
exchanges, rely on XML as their backbone for data interchange. Due to its
extensible definition capabilities, its wide acceptance and the existing large
number of utilities and libraries for XML, the definition a 'QuakeML' standard to
form a structured representation of all types of seismological data will be of great value.
Such a 'QuakeML' standard, properly defined as a multi-layer definition, could
provide the community with one single standard format covering parameter, phase,
and waveform data, according to the needs of the user. We propose a 3-layer
definition of QuakeML: Layer 1 provides parameter data like multiple hypocenter
location (e.g., automatic and manual locations), date/time, and magnitude. Additional
information, e.g., improved quality information can optionally be added. This layer
qualifies for seismic network bulletins and earthquake statistics research. Layer 2
adds pick times and related information, thus suitable for tomographic studies or
extended network bulletins. Layer 3 contains waveforms, making it the most
comprehensive description of an earthquake.
One of the major problems related to data exchange in seismology arises from the
different needs in storing information. Although many earthquake parameters are common
to most earthquake catalogs, these catalogs differ in their selection and format of
stored parameters, making the definition of a practical common format for earthquake
data almost impossible until now.
For achieving one format for seismological data interchange, the underlying technique
must allow for user specific extensions without compromising the format definition or
without making the data files unreadable for other users. This restriction prohibits
the use of tabulator separated or column-oriented ASCII-files.
A future, more versatile format should meet additional requirements. We propose to use
open standards only in order to make the implementation platform and system
independent. Furthermore, open source software and multi-platform tools should be
available for working with data in the new format. This is important in order to assure
royalty-free access to software that is needed to work with the new format.
We selected XML for the format definition because it meets all our requirements and it
is already widely used in scientific and especially business applications. The switch
to an XML data representation offers several advantages. The seismological community
is traditionally quick to reconsider the computational setup, procedures and data
handling as new technologies emerge. With the omnipresence of the Internet, data
exchange has become a natural and easy procedure. However, remarkably we still use the
old data formats and data exchange procedures. In the last years, the
World Wide Web Consortium (W3C) developed numerous
standards and recommendations for data representation and handling (see Appendix). They
reflect the increasingly recognized needs for easy and flexible data exchange.
Basically, XML is the center point of these technologies. It is not only a
meta-language to describe object-oriented data representation designed for the use in
the Internet. It is more: XML is probably the most flexible data representation. Its
main advantages are:
- Tagged ASCII-files: Any XML-file is a plain ASCII-file. The
information is coded with tags. This makes XML-files human (and machine) readable
and platform independent.
- XML Schema (XSD): Schemas, itself expressed in XML, provide a
comprehensive format definition language for describing own XML formats. They can be
used to validate XML files with a parser.
- Parser: A parser is a program that analyses the grammatical
structure of an input, with respect to a given formal grammar, here the schema.
Open source parsing and validating tools are available for many platforms as well as
for many programming languages. Most XML parser uses the platform- and
language-neutral interfaces Simple API for XML (SAX)
or the Document Object Model (DOM) (Wood et al., 1999) to parse an XML document into
objects of a programming language. A great variety of such interfaces exists for most
programming languages: e.g., Xerxes (Java, C++),
Expat (C),
XML-fortan project implementing SAX
(Fortran), and Open XML (Delphi, Pascal), offering
a professional toolkit for working with XML-files.
- Individual extensibility: Any XML-definition can readily integrate
additional data. This makes individual extensions of QuakeML possible without
compromising the validity. Considering the aforementioned layers as extensions, any
program dealing with a certain layer of our definition can use any catalog with higher
layer definitions. For example, import routines supporting layer 1 can without any
modification import layer 2 files while ignoring the additional data fields.
- Stylesheet transformation (XSLT): With
XSLT, any XML-file can be transformed into another
XML-file (e.g., separating certain values, performing queries), into HTML-pages for websites
or web applications, or into simple ASCII-files (CSV style) for importing data into existing
programs. XSLT (e.g., Xalan
for Java/C++ and Saxon for Java) use eXtended
Stylesheet Language (XSL) files as instruction. Using XSL-FO (formatting objects), PDF output
is possible. No complex programs have to be written for transforming the information into
web-suitable formats.
- Binding: Binding provides a fast and convenient way to bind XML
Schemas to a programming languages object model, making it easy for developers to
incorporate XML data and processing functions into applications. The binding API
translates the XML schema definitions into an object model of a programming language.
Several binding APIs are available: e.g., JAXB
(Java), the Castor Project (Java), and
LMX (C++).
In general, XML-data files can be used to store data. When dealing with relatively large
amounts of data, as commonly done in seismological observatories, simple file handling
becomes unsuitable. In this case, the use of XML databases should be considered. Even
SQL-databases can be used for data storage, either by developing suitable import and export
filters, or by using an XML-wrapper component that converts XML-files into relational data
structures. In the latter case, the SQL database behaves like an XML database. Many database
applications provide XML support nowadays. Considering the fact that most observatories that
use databases are already storing their data in SQL databases, import and export filter seem
to be the appropriate solution. QuakeML itself is mainly designed for information interchange
not as a storage format.
The QuakeML definition, described in the XSD schema language, is divided into several layers,
see Figure 1. Layer 1, the basic layer, contains the necessary earthquake parameters as used
in earthquake catalogs or bulletins and optionally basic quality descriptions. A preliminary
layer 1 definition has been completed at the Swiss Seismological Service (SED) and is
described in this article. Layer 2 is designed to extend layer 1 with pick information without
changing any definition already made for layer 1. This work is in progress at the SED. Layer 2
is meant to be used for tomographic studies or extended earthquake catalogs with pick
information. Layer 3 again extends layer 2 by adding waveforms. Here we propose to use the
XML-SEED definition (Tsuboi and Morino, 2004) and hence incorporate XML-SEED into QuakeML.
Figure 1: Multi-layer approach. Every layer consists of mandatory and optional
data fields. Layer 2 is an extension to layer 1.
A closer look at layer 1: Earthquake parameters
Every event in QuakeML consists of one or more locations. This offers the possibility to manage
multiple locations from different sources and also to store all locations, from the first
automatic over manual to revised locations. Every location consists of a unique identifier,
origin time, latitude, longitude, depth, and magnitude. It is additionally accompanied with
information about the author of this location, the type of event, and the region for matching
the needs of earthquake bulletins. The origin time is separated into year, month, day, hour,
minute, and second. Although an XML data type for date/time information exists, we choose to
separate the values to facilitate storing of historic catalogs where a part of this information
may not be available. When using the XML data type, a full date/time information would have to
be given.
Usually, earthquake parameters in bulletins come as a plain ASCII file, see Figure 2.
DATE TIME (UTC) LAT LON Z MAG T AUTHOR REGION
2004/09/28 17:15:24.0 35.8N 120.4W 7 M 6.0 M NEIC CENTRAL CALIFORNIA
2004/09/29 17:10:04.0 36.0N 120.5W 11 M 5.0 M NEIC CENTRAL CALIFORNIA
2004/09/30 18:54:28.0 36.0N 120.5W 10 M 5.0 M NEIC CENTRAL CALIFORNIA
Figure 2: Parkfield earthquake and two aftershocks in a plain ASCII representation.
The plain ASCII example data of Figure 2 may be translated into a QuakeML representation
(Figure 3).
<quakeml>
<event unique_id="EV_01">
<location main="true" unique_id="LOC_01" analysis-type="M">
<origin-date timezone="00:00">
<year>2004</year>
<month>09</month>
<day>28</day>
<hour>17</hour>
<minute>15</minute>
<seconds>24.0</seconds>
</origin-date>
<latitude error="0">35.8</latitude>
<longitude error="0">-120.4</longitude>
<depth unit="km" error="0">7</depth>
<magnitude unit="M" error="0">6.0</magnitude>
<region>CENTRAL CALIFORNIA</region>
<author>NEIC</author>
</location>
</event>
<event unique_id="EV_02">...</event>
<event unique_id="EV_03">...</event>
</quakeml>
Figure 3: A simplified QuakeML data example.
The XML representation can now be validated against an XML Schema definition. Schemas
are supporting rigorous definitions by offering the possibility of defining constraints
for every data parameter and of providing metadata like physical units.
As can be seen, the location tag contains several attributes. The first attribute holds
the information of whether or not the location is the prioritized main solution for an
event among other solutions. The second attribute is a unique identifier. The last
attribute classfies the location an automatically or manually derived solution. The first
parameter tag of the location tag is the origin date, consisting of several sub-tags. The
origin date tag has a time zone attribute. The geographic coordinates come with an error
attribute. The value has to be floating-point values in the ranges from -180 to 180 and
-90 to 90, respectively. The depth is also a floating-point value. Its physical unit is
held in an attribute. The magnitude tag comes with two attributes. The first is a simple
error value and second an indication about the magnitude type. The last two tags contain
the author of the origin and a place name of the location. As mentioned before, this given
selection of parameters may be extended very easily, as described in the next section.
Collecting earthquake lists, presenting them in the Internet and sending alarms is the goal
of many web applications. To demonstrate the power of QuakeML, we developed such a web
application at www.quakeml.ethz.ch. The core of
this application is a QuakeML file as shown in Figure 3 and the corresponding XML schema
is illustrated in the appendix.
QuakeML fits: Extension and customization
As mentioned before, our basic QuakeML definition can be extended with additional data fields
(tags) to customize it according to different needs. An example of such a customization is
discussed below. Because of a Switzerland specific Cartesian coordinate system used at the SED,
we had to extend the data model as illustrated in Figure 4 and Figure 5. This extension is
based on the original QuakeML schema; however, the extended data file can be read by any
application, which implemented import according to the original QuakeML schema, because the
additional data will simply be ignored.
<location main="true" unique_id="LOC999" analysis-type="M">
<origin-date timezone="00:00">
<year>2003</year>
<month>02</month>
<day>22</day>
<hour>20</hour>
<minute>41</minute>
</origin-date>
<latitude>48.4</latitude>
<longitude>6.5</longitude>
<magnitude unit="ML">5.5</magnitude>
<region>FRANCE</region>
<author>SED</author>
</location>
<my_location main="true" unique_id="LOC999" analysis-type="M">
<origin-date timezone="00:00">
<year>2004</year>
<month>06</month>
<day>21</day>
<hour>23</hour>
<minute>10</minute>
</origin-date>
<latitude>47.503</location>
<longitude>7.711</location>
<magnitude unit="ML">5.5</location>
<region>FRANCE</region>
<author>SED</author>
<swissX>620</swissX>
<swissY>261</swissY>
</my_location>
Figure 4: Simplified XML data and its extension.
Figure 5: XML Schema of the type 'location' and an example of a user-specific extension.
Software development is a complicated process and becomes increasingly complex with the number
of people involved. A plan is an essential credential for a system like a new data format;
without it, the software development process could spin out of control. Having a conceptual
blueprint helps solving problems not only during the initial development stage but also when
maintaining or revising the development.
The Unified Modeling Language (UML) (Rumbaugh et al., 1998) is a widely accepted graphical
language for visualizing, specifying, constructing and documenting the structure of a software
system or data model. It can be used as a blueprint for the development of QuakeML.
The following section describes a possible way from modeling seismic data into an XML Schema
format. We show a three level design approach (see Figure 6) according to Routledge et al. (2002).
These levels are software development levels and have nothing to do with the proposed information
layers of QuakeML.
Figure 6: Three level design approach according to Routledge et al. (2002).
The first level is the conceptual level. With the help of a Use Case Diagram we group the
seismic data and find their relationships. The result is a first structuring of a seismic
data format as can be seen in Figure 7. A Use Case may become a future class or a package
in the Class Diagrams. Because XML allows hierarchical data structures, we used arrows to
indicate this hierarchy.
Figure 7: The conceptual level. A QuakeML document can contain one or more events, while each event
can contain one or more locations.
A class is an instance or map of a real world object. In the UML Class Diagram Model we show the
abstract states and relationships of real objects. It is a very close representation of the real
source code: the logical level of the three level design approach. This level consists of three
steps according to Bird et al. (2000). In the first step we create simple data types. These data
types are used like brick stones to build on. For example a geographical coordinate could be such
a brick stone. In the next step we model more complex types, grouping major facts. These complex
data types may correspond to the use cases of the conceptual level. For example a seismic location
and its parameter could be such a complex data type. In the last step of the logical level we
create elements inherited by the complex types and build the relations between this elements.
Figure 8: UML Class Diagram according to the logical level.
In the physical model we translate the logical model into an implementation language—in this
case XML Schema (Figure 9).
<!-- location -->
<xs:complexType name="type_location">
<xs:sequence>
<xs:element name="origin-date" type="type_date" minOccurs="0"/>
<xs:element name="latitude" type="types:type_LatLon" minOccurs="0"/>
<xs:element name="longitude" type="types:type_LatLon" minOccurs="0"/>
<xs:element name="depth" type="types:type_Depth" minOccurs="0"/>
<xs:element name="magnitude" type="types:type_Magnitude" minOccurs="0"/>
<xs:element name="region" type="xs:string" minOccurs="0"/>
<xs:element name="author" type="xs:string" minOccurs="0"/>
<xs:element name="methode" type="xs:string" minOccurs="0"/>
</xs:sequence>
<xs:attribute name="main" type="xs:boolean" default="true"/>
<xs:attribute name="unique_id" type="xs:ID"/>
<xs:attribute name="analysis-type" type="types:enu_LocType" use="optional" default="M"/>
</xs:complexType>
Figure 9: XML Schema representation of the logical model for event locations.
Below we highlight some of the advantages of XML and QuakeML, respectively:
Historic catalogs
Missing data in catalogs of historic seismicity introduces problems when using fixed-column
formats. Often only the day but not the time of an historic event is known. In fixed-column
formats, this lack of information needs either special codes to reflect the missing data or
columns are left blank, which creates problems in design of import filters. Additional
semantic rules are required. Likewise errors in time and location can be large (years in
historic catalogs or even thousands of years in paleoseismic analyses), sometimes extending
beyond the originally anticipated fixed-column formatted data. In QuakeML, any information
except the year is optional and can be extended by error information of any length and
precision.
An additional possible layer which we did not define so far could store macroseismic information,
thus extending layer 1 information for historic catalogs. For each event, all macroseismic
observations could be grouped together. Due to the flexibility of XML, this extension could also
be used for modern catalogs to add macroseismic information if available. Commonly the sizes of
historic earthquakes are described through epicentral intensities instead of magnitudes. Compiling
a catalog with only epicentral intensities per event instead of magnitudes would not compromise
the QuakeML format and programs capable of importing layer 1 catalogs are able to import this
catalog.
Different interpretations of the historic information can lead to totally different locations of
an event. The ability of QuakeML to store multiple locations per event encourages keeping all
available information in the catalog without loosing the possibility of easily importing the data.
Combined catalogs, containing modern earthquake data as well as historic events are virtually the
same from the technical point of view. So far, we do not know any catalog format definition that
offers this flexibility.
XSLT
The eXtensible Stylesheet Transformation (XSLT) rounds off the concept of separating data from its
presentation. While the data is stored in XML files, these files are not meant for presenting the
data. This task can be accomplished in a very convenient way using stylesheets (XSL) and XSLT. The
main advantage in this approach is the availability of fully developed XSLT engines. Only the
stylesheets (an XML file again) need to be designed. With these stylesheets, the XSLT processor can
generate almost any desired target format:
- XML: It can generate again XML files, hereby performing queries to the
original XML file or simple reshaping the files. Also sorting is possible.
- HTML: A very common target format for presenting data on the web is HTML.
Because HTML is also a 'tag'-format, transformation from XML to HTML is easily realized.
- ASCII: For importing QuakeML data into existing software, which expects
ASCII-files of specific format, these ASCII-files can also be generated with XSLT. The respective
stylesheet can also contain sorting or querying capabilities. This output format can be considered
the interface between modern XML representation of data and the 'old' more restricted ASCII-files.
- SVG: The Scalable Vector Graphics (SVG) format is increasingly important for
geographic information systems as well as in web-based applications. It is, like XSL or XSD, again
an XML definition and describes vector images. Native SVG image editors are available and
web-browsers are currently introducing SVG support.
- RSS: News feeds (e.g., at the USGS) are implemented using RSS (again an XML
definition). With XSLT, any earthquake information can readily be transformed to RSS.
- PDF etc.: Even non-'tag' formats are possible with XSLT and associated toolkits.
Using XML-FO (formatting objects), PDF output is possible. Also generation of JPG images is possible,
e.g., plotting a marker at a certain position in an image.
This incomplete list highlights some of the possibilities of XSLT. XSLT-processors are readily available
for many platforms (some under open-source licenses) and do not require development on the user's side.
With XSLT, any earthquake information in QuakeML can automatically be transformed into multiple formats,
for web presentation, news feeds, bulletins etc.
Individual extensions of QuakeML
Any XML definition can be extended in two ways, by including additional, user specific fields (tags) as
described above or by including itself into another XML definition. QuakeML file can be included in any
given XML file which is particular useful for generating earthquake alarms or event notifications in XML.
In this case, an XML wrapper describing the alarm includes a full QuakeML description of one or more
events with their respective locations. This QuakeML part can then easily be extracted using XSLT.
While the use of the internet and the web has quickly become indispensable for the seismological community,
the use of other modern techniques such as XML has thus far been largely ignored.
We believe, that our community should make use of these new technologies and we hope that this article,
introducing the advantages of XML itself and its potential use in seismology, encourages seismologists to
consider XML representations of their data. At the Swiss Seismological Service, we are creating shakemaps
based on XML-data and we started to implement earthquake parameter data, which will be extended with a new
alarm system based on XML. Our QuakeML definition is meant to match the requirements of most seismological
networks. Because implementing extensions remains an easy task and does not compromise the format, all
networks could readily develop a QuakeML definition for their needs which is immediately readable for other
users.
The value of QuakeML increases proportionally to the number of users and observatories applying it. Our
vision and hope are that, because of its intrinsic advantages to previous seismological formats, QuakeML
will become universally used.
The History of XML
The eXtensible Markup Language (XML) is a puzzle piece of a long evolution started in the 1970's
by Goldfarb (1973), E. Mosher, and R. Lorie at IBM. They worked on an integrated text processing system and
invented the Standard General Markup Language (SGML) (Goldfarb, 1973), what became adapted by the ISO in 1986
(SGML, 1986). SGML is an extremely powerful and extensible tool for semantic markup which is particularly
useful for cataloging and indexing data. It can be used to create other markup for example HTML or XML. The
shortcoming of SGML is its complexity, especially for the everyday uses of the web. In 1989, T. Berners-Lee at
CERN wrote an internal proposal about information management (Berners-Lee, 1989). He discusses the problem of
loosing information in large scale projects. Wherefore HTML was originally designed to provide a very simple
version of SGML. The same simple needs lead to the development of XML. In 1996, discussions began which focused
on how to define a markup language with the power and extensibility of SGML but with the simplicity of HTML
(Bray and Sperberg-McQueen, 1996). Like HTML, XML spread like a wildfire. To control this evolution, the
World Wide Web Consortium was founded in 1994 by T. Berners-Lee.
The world of XML can be divided into three parts: The first one are the XML Accessories, languages which are
intended for wide use to extend the capabilities specified in XML. Examples of XML accessories are the XML Schema
language extending the definition capability of XML DTD. XML Transducers are languages which are intended for
transducing some input XML data into some output form. Examples of XML transducers are the stylesheet languages
CSS and XSL. XML Applications are languages which define constraints for a class of XML data for some special
application area. Examples of XML applications are MathML defined for mathematical data (Figure 10).
Figure 10: Diagram showing the development and dependencies of SGML related technologies. XML and HTML are
subsets of SGML.
XSL is an official recommendation of the World Wide Web Consortium (W3C). It provides a language to transform XML
documents into other formats. These can be an HTML document, another XML document, a PDF, a JPEG file, or anything
you want. The XSL stylesheet defines the transformation and an XSLT processor does the work. A first proposal
(Adler et al., 1997) was submitted to the W3C in 1997 and has become a recommendation in 1999.
XML Schema was submitted to the W3C in 1999 by Malhotra and Maloney (1999) and is maintained by another working
group. It has become a recommendation (Fallside, 2000). Schemas serve for describing the structure and constraining
the contents of XML documents.
- Adler, S., et al., A proposal for XSL, Tech. rep., W3C, 1997.
- Berners-Lee, T., Information management: A proposal, Tech. rep., CERN, 1989.
- Bird, L., Goodchild, A., and Halpin, T., Object role modelling and XML-Schema,
Tech. rep., The University of Queensland, Australia, 2000.
- Bray, T., and Sperberg-McQueen, C. M., Extensible Markup Language (XML),
Working draft, W3C, 1996.
- Fallside, D. C., XML Schema Part 0: Primer, Working draft, W3C, 2000.
- Goldfarb, C. F., Design considerations for integrated text processing systems,
IBM Cambridge Scientific Center Technical Report, 320-2094, 1973.
- Malhotra, A., and Maloney, M., XML Schema requirements, Tech. rep., W3C, 1999.
- Routledge, N., Bird, L., and Goodchild, A., UML and XML Schema, Tech. rep.,
University of Queensland, Australia, 2002.
- Rumbaugh, J., Jacobson, I., and Booch, G., Unified Modeling Language Reference Manual,
Addison-Wesley, 1998.
- SGML, ISO 8879, 1986.
- Tsuboi, S., and Morino, S., Conversion of SEED format to XML representation for
a new standard of seismic waveform exchange, ORFEUS Electronic Newsletter,
in this volume, 2004.
- Wood, L., Apparao, V., Champion, M., Hors, A. L., Pixley, T., Robie, J., Sharpe, P.,
and Wilson, C., Document Object Model (DOM) Level 2 Specification Version 1.0,
Working draft, W3C, 1999.
|