Get Started in 10 Minutes

Altay Sansal

Oct 20, 2025

9 min read

In this page we will be showing basic capabilities of MDIO.

For demonstration purposes, we will ingest the remote Teapot Dome open-source dataset. The dataset details and licensing can be found at the SEG Wiki.

We are using the 3D seismic stack dataset named filt_mig.sgy.

The full HTTP link for the dataset (hosted on AWS): http://s3.amazonaws.com/teapot/filt_mig.sgy

Warning

For plotting and remote ingestion the notebook requires Matplotlib and aiohttp as a dependency. Please install it before executing using pip install matplotlib aiohttp or conda install matplotlib aiohttp.

Defining the SEG-Y Dataset

Since MDIO 0.8 we can directly ingest remote SEG-Y files! The file is 386 MB in size. To make the header scan performant we can also set up an environment variable for MDIO. See here for when to use this: Buffered Reads in Ingestion.

The dataset is irregularly shaped, however it is padded to a rectangle with zeros (dead traces). We will see that later at the live mask plotting.

The following environment variables are essential here:

  • MDIO__IMPORT__CLOUD_NATIVE tells MDIO to do buffered reads for headers due to remote file.

  • MDIO__IMPORT__SAVE_SEGY_FILE_HEADER makes MDIO save the SEG-Y specific file headers (text, binary) which is not strictly necessary for consumption and is disabled by default.

import os

os.environ["MDIO__IMPORT__CLOUD_NATIVE"] = "true"
os.environ["MDIO__IMPORT__SAVE_SEGY_FILE_HEADER"] = "true"

input_url = "http://s3.amazonaws.com/teapot/filt_mig.sgy"

Ingesting to MDIO

To do this, we can use the convenient SEG-Y to MDIO converter.

The inline and crossline values are located at bytes 181 and 185. Note that this doesn’t match any SEG-Y standards.

MDIO uses TGSAI/segy to parse the SEG-Y; the field names conform to its canonical keys defined in SEGY Binary Header Keys and SEGY Trace Header Keys. Since MDIO v1 we also introduced templates for common seismic data types. For instance, we will be using the PostStack3DTime template here, which expects the same canonical keys.

We will also specify the units for the time domain. The spatial units will be automatically parsed from SEG-Y binary header. However, there may be a case where it is corrupt in the file, for that see the Fixing X/Y Units Issues section.

In summary, we will use the byte locations as defined for ingestion.

import matplotlib.pyplot as plt
from segy.schema import HeaderField
from segy.standards import get_segy_standard

from mdio import segy_to_mdio
from mdio.builder.schemas.v1.units import TimeUnitModel
from mdio.builder.template_registry import get_template

teapot_trace_headers = [
    HeaderField(name="inline", byte=181, format="int32"),
    HeaderField(name="crossline", byte=185, format="int32"),
    HeaderField(name="cdp_x", byte=189, format="int32"),
    HeaderField(name="cdp_y", byte=193, format="int32"),
]

rev0_segy_spec = get_segy_standard(0)
teapot_segy_spec = rev0_segy_spec.customize(trace_header_fields=teapot_trace_headers)

mdio_template = get_template("PostStack3DTime")
unit_ms = TimeUnitModel(time="ms")
mdio_template.add_units({"time": unit_ms})

segy_to_mdio(
    input_path=input_url,
    output_path="filt_mig.mdio",
    segy_spec=teapot_segy_spec,
    mdio_template=mdio_template,
    overwrite=True,
)

It only took a few seconds to ingest, since this is a very small file.

However, MDIO scales up to TB (that’s ~1,000 GB) sized volumes!

Opening the Ingested MDIO File

Let’s open the MDIO file with the open_mdio function. This will return a pretty xarray representation with our standardized format.

from mdio import open_mdio

dataset = open_mdio("filt_mig.mdio")
dataset
<xarray.Dataset> Size: 403MB
Dimensions:           (inline: 345, crossline: 188, time: 1501)
Coordinates:
  * inline            (inline) int32 1kB 1 2 3 4 5 6 ... 340 341 342 343 344 345
  * crossline         (crossline) int32 752B 1 2 3 4 5 6 ... 184 185 186 187 188
  * time              (time) int32 6kB 0 2 4 6 8 10 ... 2992 2994 2996 2998 3000
    cdp_y             (inline, crossline) float64 519kB ...
    cdp_x             (inline, crossline) float64 519kB ...
Data variables:
    trace_mask        (inline, crossline) bool 65kB ...
    segy_file_header  <U1 4B ...
    amplitude         (inline, crossline, time) float32 389MB ...
    headers           (inline, crossline) [('trace_seq_num_line', '<i4'), ('trace_seq_num_reel', '<i4'), ('orig_field_record_num', '<i4'), ('trace_num_orig_record', '<i4'), ('energy_source_point_num', '<i4'), ('ensemble_num', '<i4'), ('trace_num_ensemble', '<i4'), ('trace_id_code', '<i2'), ('vertically_summed_traces', '<i2'), ('horizontally_stacked_traces', '<i2'), ('data_use', '<i2'), ('source_to_receiver_distance', '<i4'), ('receiver_group_elevation', '<i4'), ('source_surface_elevation', '<i4'), ('source_depth_below_surface', '<i4'), ('receiver_datum_elevation', '<i4'), ('source_datum_elevation', '<i4'), ('source_water_depth', '<i4'), ('receiver_water_depth', '<i4'), ('elevation_depth_scalar', '<i2'), ('coordinate_scalar', '<i2'), ('source_coord_x', '<i4'), ('source_coord_y', '<i4'), ('group_coord_x', '<i4'), ('group_coord_y', '<i4'), ('coordinate_unit', '<i2'), ('weathering_velocity', '<i2'), ('subweathering_velocity', '<i2'), ('source_uphole_time', '<i2'), ('group_uphole_time', '<i2'), ('source_static_correction', '<i2'), ('receiver_static_correction', '<i2'), ('total_static_applied', '<i2'), ('lag_time_a', '<i2'), ('lag_time_b', '<i2'), ('delay_recording_time', '<i2'), ('mute_time_start', '<i2'), ('mute_time_end', '<i2'), ('samples_per_trace', '<i2'), ('sample_interval', '<i2'), ('instrument_gain_type', '<i2'), ('instrument_gain_const', '<i2'), ('instrument_gain_initial', '<i2'), ('correlated_data', '<i2'), ('sweep_freq_start', '<i2'), ('sweep_freq_end', '<i2'), ('sweep_length', '<i2'), ('sweep_type', '<i2'), ('sweep_taper_start', '<i2'), ('sweep_taper_end', '<i2'), ('taper_type', '<i2'), ('alias_filter_freq', '<i2'), ('alias_filter_slope', '<i2'), ('notch_filter_freq', '<i2'), ('notch_filter_slope', '<i2'), ('low_cut_freq', '<i2'), ('high_cut_freq', '<i2'), ('low_cut_slope', '<i2'), ('high_cut_slope', '<i2'), ('year_recorded', '<i2'), ('day_of_year', '<i2'), ('hour_of_day', '<i2'), ('minute_of_hour', '<i2'), ('second_of_minute', '<i2'), ('time_basis_code', '<i2'), ('trace_weighting_factor', '<i2'), ('group_num_roll_switch', '<i2'), ('group_num_first_trace', '<i2'), ('group_num_last_trace', '<i2'), ('gap_size', '<i2'), ('taper_overtravel', '<i2'), ('inline', '<i4'), ('crossline', '<i4'), ('cdp_x', '<i4'), ('cdp_y', '<i4')] 13MB ...
Attributes:
    apiVersion:  1.0.8
    createdOn:   2025-10-20 15:52:03.112832+00:00
    name:        PostStack3DTime
    attributes:  {'surveyType': '3D', 'gatherType': 'stacked', 'defaultVariab...

Querying Metadata

Now let’s look at the file text header saved in the segy_file_header metadata variable.

print(dataset["segy_file_header"].attrs["textHeader"])
C 1 CLIENT: ROCKY MOUNTAIN OILFIELD TESTING CENTER                              
C 2 PROJECT: NAVAL PETROLEUM RESERVE #3 (TEAPOT DOME); NATRONA COUNTY, WYOMING  
C 3 LINE: 3D                                                                    
C 4                                                                             
C 5 THIS IS THE FILTERED POST STACK MIGRATION                                   
C 6                                                                             
C 7 INLINE 1, XLINE 1:   X COORDINATE: 788937  Y COORDINATE: 938845             
C 8 INLINE 1, XLINE 188: X COORDINATE: 809501  Y COORDINATE: 939333             
C 9 INLINE 188, XLINE 1: X COORDINATE: 788039  Y COORDINATE: 976674             
C10 INLINE NUMBER:    MIN: 1  MAX: 345  TOTAL: 345                              
C11 CROSSLINE NUMBER: MIN: 1  MAX: 188  TOTAL: 188                              
C12 TOTAL NUMBER OF CDPS: 64860   BIN DIMENSION: 110' X 110'                    
C13                                                                             
C14                                                                             
C15                                                                             
C16                                                                             
C17                                                                             
C18                                                                             
C19 GENERAL SEGY INFORMATION                                                    
C20 RECORD LENGHT (MS): 3000                                                    
C21 SAMPLE RATE (MS): 2.0                                                       
C22 DATA FORMAT: 4 BYTE IBM FLOATING POINT                                      
C23 BYTES  13- 16: CROSSLINE NUMBER (TRACE)                                     
C24 BYTES  17- 20: INLINE NUMBER (LINE)                                         
C25 BYTES  81- 84: CDP_X COORD                                                  
C26 BYTES  85- 88: CDP_Y COORD                                                  
C27 BYTES 181-184: INLINE NUMBER (LINE)                                         
C28 BYTES 185-188: CROSSLINE NUMBER (TRACE)                                     
C29 BYTES 189-192: CDP_X COORD                                                  
C30 BYTES 193-196: CDP_Y COORD                                                  
C31                                                                             
C32                                                                             
C33                                                                             
C34                                                                             
C35                                                                             
C36 Processed by: Excel Geophysical Services, Inc.                              
C37               8301 East Prentice Ave. Ste. 402                              
C38               Englewood, Colorado 80111                                     
C39               (voice) 303.694.9629 (fax) 303.771.1646                       
C40 END EBCDIC                                                                  

Since we saved the binary header, we can look at that as well.

dataset["segy_file_header"].attrs["binaryHeader"]
{'job_id': 9999,
 'line_num': 9999,
 'reel_num': 1,
 'data_traces_per_ensemble': 188,
 'aux_traces_per_ensemble': 0,
 'sample_interval': 2000,
 'orig_sample_interval': 0,
 'samples_per_trace': 1501,
 'orig_samples_per_trace': 1501,
 'data_sample_format': 1,
 'ensemble_fold': 57,
 'trace_sorting_code': 4,
 'vertical_sum_code': 1,
 'sweep_freq_start': 0,
 'sweep_freq_end': 0,
 'sweep_length': 0,
 'sweep_type_code': 0,
 'sweep_trace_num': 0,
 'sweep_taper_start': 0,
 'sweep_taper_end': 0,
 'taper_type_code': 0,
 'correlated_data_code': 2,
 'binary_gain_code': 1,
 'amp_recovery_code': 4,
 'measurement_system_code': 2,
 'impulse_polarity_code': 1,
 'vibratory_polarity_code': 0,
 'segy_revision_major': 0,
 'segy_revision_minor': 0}

Fetching Data and Plotting

Now we will demonstrate getting an inline from MDIO.

Since MDIO v1 we are using Xarray under the hood, so we can use its convenient indexing. It also handles the plotting with proper dimension coordinate labels.

MDIO stores summary statistics. We can calculate the standard deviation (std) value of the dataset to adjust the gain.

from mdio.builder.schemas.v1.stats import SummaryStatistics

stats = SummaryStatistics.model_validate_json(dataset["amplitude"].attrs["statsV1"])
std = ((stats.sum_squares / stats.count) - (stats.sum / stats.count) ** 2) ** 0.5

il_dataset = dataset.sel(inline=278)
il_amp = il_dataset["amplitude"].T
il_amp.plot(vmin=-2 * std, vmax=2 * std, cmap="gray_r", yincrease=False);
../_images/de066e07fb2ff94a0593629290e42fde4de91322beaa1812099a822a03f9af7f.png

Let’s do the same with a time slice.

We will display two-way-time at 1,000 ms.

Note that since we parse the X/Y coordinates, we can plot time slice in real world coordinates.

twt_data = dataset["amplitude"].sel(time=1000)
twt_data.plot(vmin=-2 * std, vmax=2 * std, cmap="gray_r", x="cdp_x", y="cdp_y");
../_images/39cae028c7029c6f5336e0c89836b5a70b319960eb7dc530e44442ca743af867.png

We can also overlay live mask with the time slice. However, in this example, the dataset is zero-padded.

The live trace mask will always show True (yellow).

trace_mask = dataset.trace_mask[:]

twt_data.plot(vmin=-2 * std, vmax=2 * std, cmap="gray_r", x="cdp_x", y="cdp_y", alpha=0.5, figsize=(8, 5))
trace_mask.plot(vmin=0, vmax=1, x="cdp_x", y="cdp_y", alpha=0.5);
../_images/fa60859b29e6801ebf1b7b7aba62bc0c0f56e956225e7493dfdc09c53e2584e1.png

Query Headers

We can query headers for the whole dataset very quickly because they are separated from the seismic wavefield.

Let’s get all the headers for X and Y coordinates in this dataset.

Note that the header maps will still share the geometry/grid of the dataset!

The compute property fetches the lazily opened MDIO values.

dataset.headers["cdp_x"].compute()
<xarray.DataArray 'cdp_x' (inline: 345, crossline: 188)> Size: 519kB
array([[78893.7, 78904.7, 78915.7, ..., 80928.2, 80939.2, 80950.2],
       [78893.5, 78904.5, 78915.5, ..., 80927.9, 80938.9, 80949.9],
       [78893.2, 78904.2, 78915.2, ..., 80927.6, 80938.6, 80949.6],
       ...,
       [78804.4, 78815.4, 78826.4, ..., 80838.9, 80849.9, 80860.9],
       [78804.2, 78815.2, 78826.2, ..., 80838.6, 80849.6, 80860.6],
       [78803.9, 78814.9, 78825.9, ..., 80838.3, 80849.3, 80860.3]],
      shape=(345, 188))
Coordinates:
  * inline     (inline) int32 1kB 1 2 3 4 5 6 7 ... 339 340 341 342 343 344 345
  * crossline  (crossline) int32 752B 1 2 3 4 5 6 7 ... 183 184 185 186 187 188
    cdp_y      (inline, crossline) float64 519kB 9.388e+04 ... 9.772e+04
    cdp_x      (inline, crossline) float64 519kB 7.889e+04 ... 8.086e+04
Attributes:
    unitsV1:  {'length': 'ft'}
dataset.headers["cdp_y"].compute()
<xarray.DataArray 'cdp_y' (inline: 345, crossline: 188)> Size: 519kB
array([[93884.6, 93884.8, 93885.1, ..., 93932.9, 93933.1, 93933.4],
       [93895.6, 93895.8, 93896.1, ..., 93943.9, 93944.1, 93944.4],
       [93906.6, 93906.8, 93907.1, ..., 93954.9, 93955.1, 93955.4],
       ...,
       [97645.5, 97645.8, 97646. , ..., 97693.8, 97694.1, 97694.3],
       [97656.5, 97656.8, 97657. , ..., 97704.8, 97705.1, 97705.3],
       [97667.5, 97667.8, 97668. , ..., 97715.8, 97716.1, 97716.3]],
      shape=(345, 188))
Coordinates:
  * inline     (inline) int32 1kB 1 2 3 4 5 6 7 ... 339 340 341 342 343 344 345
  * crossline  (crossline) int32 752B 1 2 3 4 5 6 7 ... 183 184 185 186 187 188
    cdp_y      (inline, crossline) float64 519kB 9.388e+04 ... 9.772e+04
    cdp_x      (inline, crossline) float64 519kB 7.889e+04 ... 8.086e+04
Attributes:
    unitsV1:  {'length': 'ft'}

As we mentioned before, we can also get specific dataset slices of headers while fetching a slice.

Let’s fetch a crossline; we are still using some previous parameters.

Since the sliced dataset contains the headers as well, we can plot the headers on top of the image.

Full headers can be mapped and plotted as well, but we won’t demonstrate that here.

xl_dataset = dataset.sel(crossline=100)  # slices everything available in MDIO dataset!

cdp_x_header = xl_dataset["cdp_x"]
cdp_y_header = xl_dataset["cdp_y"]
image = xl_dataset["amplitude"].T

# Build plot from here
gs_kw = {"height_ratios": (1, 8)}
fig, (hdr_ax, img_ax) = plt.subplots(2, 1, figsize=(6, 6), gridspec_kw=gs_kw, sharex="all")
hdr_ax2 = hdr_ax.twinx()

cdp_x_header.plot(ax=hdr_ax, c="red")
cdp_y_header.plot(ax=hdr_ax2, c="black")
image.plot(ax=img_ax, vmin=-2 * std, vmax=2 * std, cmap="gray_r", yincrease=False, add_colorbar=False)
img_ax.set_title("")
hdr_ax.set_xlabel("")

plt.tight_layout()
../_images/dca69828d1c9392f0396557fc83d9064e85a9b500ed3d807c05c71e36cc0fbcb.png

MDIO to SEG-Y Conversion

Finally, let’s demonstrate going back to SEG-Y.

We will use the convenient mdio_to_segy function and write it out as a round-trip file.

The output spec can be modified if we want to write things to different byte locations, etc, but we will use the same one as before.

from mdio import mdio_to_segy

mdio_to_segy(
    input_path="filt_mig.mdio",
    output_path="filt_mig_roundtrip.sgy",
    segy_spec=teapot_segy_spec,
)

Validate Round-Trip SEG-Y File

We can validate if the round-trip SEG-Y file is matching the original using TGSAI/segy.

Step by step:

  • Open original file

  • Open round-trip file

  • Compare text headers

  • Compare binary headers

  • Compare 100 random headers and traces

import numpy as np
from segy import SegyFile

original_segy = SegyFile(input_url)
roundtrip_segy = SegyFile("filt_mig_roundtrip.sgy")

# Compare text header
assert original_segy.text_header == roundtrip_segy.text_header

# Compare bin header
assert original_segy.binary_header == roundtrip_segy.binary_header

# Compare 100 random trace headers and traces
rng = np.random.default_rng()
rand_indices = rng.integers(low=0, high=original_segy.num_traces, size=100)
for idx in rand_indices:
    np.testing.assert_equal(original_segy.trace[idx], roundtrip_segy.trace[idx])

print("Files identical!")
Files identical!