Conversion of SEED format to XML representation for a new standard of
seismic waveform exchangeSeiji Tsuboi1 and Shin'ya
Morino2
1Institute
for Frontier Research on Earth Evolution (IFREE), JAMSTEC,
Japan 2Hakusan Corporation, Japan
We represent a header structure in XML
(eXtensible Markup Language) of Standard for the Exchange of Earthquake
Data (SEED), which is an international standard format for the exchange of
digital seismological data. We show that this representation allows the
extension of header content without introducing any modification to the
existing mini-SEED waveform data structure.
The Standard for the Exchange of
Earthquake Data (SEED) has been designed as an international standard
format for the exchange of digital seismological data (SEED Manual, 1993).
It is now widely used by the community that maintains the broadband
seismograph networks and it is recognized as a standard format for data
exchange. A SEED volume consists of header records and data records. The
format for data records is called mini-SEED and it is closely related to
the format recorded in data loggers. Each header is made up of a sequence
of blockettes. Since SEED blockettes are defined as a collection of named
fields with fixed length, this introduces difficulties of extension of
data structures. And because there Already exist a huge amount of waveform
data saved in mini-SEED format, it would be quit a job to fully revise the
current SEED format to allow future flexible extensions. Although it has
been recognized that the revision of SEED format is necessary, there has
been no attempt for major revision since its latest release of Ver. 2.3 in
February, 1993 because of this difficulty. Here we propose an XML
representation of the SEED header structure and show that a flexible
design and robust validation in data models will be realized at the same
time. Technical difficulties for constructing a network-based system will
also be reduced by introducing XML to the SEED data description. We also
mention the extension of XML-SEED format to synthetic seismogram
databases.
SEED was adopted as a standard format
for international digital seismic data exchange in 1987 by the Federation of Digital
Seismographic Network (FDSN), which was formed under the International Association for
Seismology and Physics of the Earth's Interior (IASPEI). Before the
SEED format was adopted, digital seismic data exchange was complicated
because of different data logger formats.
SEED was designed to accommodate comprehensively differences in data format
that originated from the type of data logger. The SEED format consists of
one logical volume, which contains two format objects: (1) control headers
and (2) time series. The first one is formatted in ASCII and contains
auxiliary information about the volume. The second one contains raw binary
data, the digital seismogram. Control headers
are categorized as (1) volume index control headers, (2) abbreviation
dictionary control headers, (3) station control headers and (4) time span
control headers. These headers are used to provide information such as the
definition of abbreviations used in the control headers, operating
characteristics for a station and its channels, and the time span of the
data. Because of these comprehensive descriptions of the SEED volume in
the control headers, the SEED format can be used to provide digital
seismograms recorded by almost any kinds of data loggers. Each control
header consists of a series of blockettes, which contain a sequence of
data fields specific to that blockette type. Because blockettes are
defined as a collection of named fields with fixed length, this introduces
difficulties of extension of data structures. On the other hand,
header structure is designed to be modular, which is similar to XML. This
similarity motivated us to represent the SEED format structure in XML.
Data structures of XML document are very flexible,
because the length of the fields are not fixed. Defining new fields and
blockettes only requires a new tag name and hierarchy specification. To
describe types of data, XML has its schema language, which is called
XML-Schema. This schema language is also used for validation of XML
document. By introducing XML into SEED, it is apparent that a flexible
design and robust validation in data models will be realized.
We have set up the
following rules for converting control headers of SEED format into XML
representation.
Document typeDocument type is defined with the root node named
'xseed'. This root node has 4 child nodes for headers as shown in the
following list. Every header node stores XML representations of the
corresponding blockettes.
Conversion of blockettesAll control header blockettes have a
'Blockette type' field and a 'Length' field to represent their type and
size. Both the blockette type and name are used as a unique-identifier for
that type of blockette. In the current SEED volumes only the type is used.
In the XML representation, the blockette name suits as an identifier of
the blockette types, because it's more descriptive, and tells the role of
blockettes by itself. The blockette type is added as an attribute for the
convenience of people familiar with blockette type numbers. The blockette
will be represented as follows with blockette name as 'blockette_name',
and blockette type as '555':
The 'Length' field is not required in the XML representation. A
blockette begins with a starting tag named blockette_name and ends with an
ending tag having the same name as the starting tag. Data for every field
can have a variable length. Field values are represented by markup entity
between the beginning and the ending tags
<field_name>Value</field_name>
where 'field_name' represents the name of the field, and 'Value'
represents the entity of field.
References and sequence number
An important characteristic of the SEED format is that some fields of particular
blockettes refer to another blockettes. To describe these references the
language XPath is chosen. Referenced blockettes always have fields for the
identifier, which is converted to string identifiers. Sequence numbers of
logical records are ignored in the 'volume index control header', because
XML documents are not stored in logical records. However the 'time span
control header' includes the description of location of waveform data that
is pointed by sequence numbers, and written as zero-filled numbers with 6
numeric characters.
When we represent the SEED header structure
in XML we do not modify anything regarding the format of the time series
data. To include binary mini-SEED format digital seismograms in XML-represented
SEED volumes we consider two scenarios. The first scenario is the separation
of the header file and the data. The data can be located in other data
files at data servers connected via networks. One can get a stand-alone header file
to know about an event, properties of stations and data locations. This is
the same concept as dataless-SEED volumes. In order to retrieve the complete
seismic waveform data one combines the two separate files; the data file is
accessed the data server following the description in the data file.
The second scenario is
the same as the current full-SEED volume so that the XML-SEED volume
includes both the SEED header represented in XML and the binary seismic
wave data. The header specifies the location of the data that is stored in
the same file. This composition is basically possible in the following
way. For example, the first line describes the length of the header,
followed by a blank line. The header XML document starts at the third line.
The format of the header part is plain-ASCII and is not based on logical
records. The data part starts at the position specified in the first line.
The data is stored in logical records like in the current SEED volumes.
With this schema the reader programs can determine the location of the
seismic wave data by using the values specified in the first line and the
sequence numbers in time span control headers.
So far, we have not
changed the current SEED header structure. Now, propose one possible
extension of a SEED volume by introducing a new tag, which has no
corresponding blockette in the current SEED format. It is the
<data_record> tag. Data records are split into two parts, the 'Fixed
Section of Data Header' (FSDH) and the Base64-encoded seismic wave data.
Field names and their values of members in FSDH are expanded in the same
style as headers do. The FSDH part is placed under the <data_header>
tag. Seismic wave data are encoded as Base64 and are placed under the
<chunk> tag. The 'data_record_length' attribute
at the <chunk> tag represents the byte length of
the decoded data. By doing this, both header and data are represented in
one volume of XML-SEED. Encoding by base64 increases the total size of the
volume, but not by an excessive amount.
Programs to convert current full-SEED volumes to XML-SEED volumes and read
XML-SEED volumes to extract seismograms are available. Currently we
provide digital broadband seismograms from the Ocean Hemisphere Project
geophysical network by XML-SEED format through
IFREE data center. The image of the webpage is shown in Figure 1.
Figure 1.
Web site for distribution of XML-SEED formatted broadband seismograms.
Recently, we have
demonstrated that we can calculate global theoretical seismograms for
realistic 3D Earth models based upon the combination of a precise
numerical technique (the spectral-element method) and a sufficiently fast
supercomputer (the Earth Simulator) (Tsuboi et al., 2002). It has now
become possible to routinely calculate synthetic seismograms for
earthquakes greater than a certain magnitude. Starting in 2003, we
selected earthquakes with magnitudes greater than 6.5 from the Harvard CMT
catalog and calculated theoretical seismograms for the stations in the
Global Seismographic Network. To distribute this synthetic seismogram
database to the seismological community we modifed the XML-SEED to include
metadata entries, which are characteristic to the synthetic seismogram
database, such as the numerical technique we used to generate the
synthetic seismograms (Tsuboi et al., 2004). We distribute these
theoretical seismograms through
IFREE/JAMSTEC and Caltech
(and select "synthetic seismograms").
The advantage of using XML for the exchange of both
observations and synthetics is illustrated in Figure 2. We are now
developing software that allows the users to retrieve both synthetics and
observations at the same time using the same user interface based on the
web services technique. For this software to work efficiently it is
important that both data and synthetics are in XML.
Figure 2.
Concept of web service based software to retrieve both data and synthetics
using the same user interface. The description of the user interface and
data transfer is summarized in Web Service Description Language (WSDL).
This figure is created by Takuya Arai of Fujitsu Corp., Japan.
We have shown that the current SEED format
can be directly translated to an XML representation without introducing
any modifications to the current format.
The advantages of using the XML representation of SEED are that
a) XML is a text-based language and easy to extend,
b) XML documents support hierarchical data structures,
c) XML is platform independent, and
d) XML suits network-based technologies.
It is straightforward to add any
necessary information at a later stage by defining tag names and include
these into the schema. Although we have not modified the current SEED
control headers, there should be various ways to extend SEED by taking
full advantage of XML. One example could be the status report of the data
logger. If the data logger reports its status or parameter settings in XML
format with its digital
seismograms, this information can be directly incorporated into the
database directory in the data center. This should greatly simplify data
quality checks done at the data center. Another example is data
distribution through the web service. As data exchange protocols for web
service is in XML. If SEED data is in XML format, we may use the control header
content described in the XML format for data exchange and distribution. We
have developed a network data center system based on Java RMI (Takeuchi et
al., 2002). We may distribute our XML-SEED formatted digital seismograms
through our network data center system to fully utilize the XML
represented header structure.
- Federation of Digital Seismograph Network: Standard for the Exchange
of Earthquake Data, Reference Manual, SEED Format Version 2.3,
Incorporated Research Institution for Seismology, 1993.
- Takeuchi, N., Watada, S., Tsuboi, S., Fukao, Y., Kobayashi, M.,
Matsuzaki, Y., and Nakashima, T., 2002. Application of distributed
object technology to seismic waveform distribution, Seismological
Research Letters, 73-2, 166-172.
- Tsuboi, S., Komatitsch, D., Ji, C., and Tromp, J., 2003. Broadband
modeling of the 2002 Denali Fault earthquake on the Earth Simulator,
Physics of the Earth and Planetary Interior, 139, 305-312.
- Tsuboi, S., Tromp, J., and Komatitisch, D., 2004. An XML-SEED Format for
the Exchange of Synthetic Seismograms, EOS Transactions of American
Geophysical Union, suppl., SF31B-03.
|