Handling Corrupt SEG-Y Files

Altay Sansal

Oct 20, 2025

8 min read

In this tutorial, we will demonstrate how to handle some of the most common SEG-Y file issues that can occur during ingestion. To illustrate these problems and their solutions, we’ll start by creating some intentionally malformed files using the TGSAI/segy library. Let’s begin by importing the modules we’ll be using throughout this tutorial.

from pathlib import Path

import numpy as np
from segy import SegyFactory
from segy.config import SegyHeaderOverrides
from segy.schema import HeaderField
from segy.standards import get_segy_standard

from mdio import open_mdio
from mdio import segy_to_mdio
from mdio.builder.template_registry import get_template

Fixing Coordinate Scalar Issues

One of the most common issues in SEG-Y files is an invalid or missing coordinate scalar value. Let’s start by creating a SEG-Y file with an intentionally incorrect coordinate scalar. We’ll create a simple toy 2D stack dataset that contains CDP (Common Depth Point) numbers and dummy CDP-X/Y coordinates in the trace headers.

To generate this example file, we will follow these steps:

  1. Create an empty SEG-Y factory with the appropriate specification.

  2. Populate the file headers (textual and binary headers).

  3. Generate 10 traces with headers and fill them with dummy sample values.

n_traces = 10

trace_header_fields = [
    HeaderField(name="cdp", byte=21, format="int32"),
    HeaderField(name="cdp_x", byte=181, format="int32"),
    HeaderField(name="cdp_y", byte=185, format="int32"),
]
spec = get_segy_standard(1.0).customize(trace_header_fields=trace_header_fields)
factory = SegyFactory(spec=spec, sample_interval=4000, samples_per_trace=1201)

txt_header = factory.create_textual_header()  # default text header
bin_header = factory.create_binary_header()  # default binary header

headers = factory.create_trace_header_template(n_traces)  # default all zero except n_samp and interval
samples = factory.create_trace_sample_template(n_traces)  # default all zero

rng = np.random.default_rng(seed=42)
headers["cdp"] = np.arange(n_traces)  # cdp
headers["coordinate_scalar"] = 0
headers["cdp_x"] = np.arange(n_traces) * 1000
headers["cdp_y"] = np.arange(n_traces) * 10000
samples[:] = rng.normal(size=samples.shape).astype("float16")

# encode traces to SEG-Y buffer and write
with Path("tmp.sgy").open(mode="wb") as fp:
    fp.write(txt_header)
    fp.write(bin_header)
    fp.write(factory.create_traces(headers, samples))

print("Wrote temporary SEG-Y file successfully.")
Wrote temporary SEG-Y file successfully.

As mentioned earlier, this file has a zero value in the coordinate scalar field. According to the SEG-Y standard (both Revision 0 and Revision 1), a coordinate scalar of zero is invalid and should not be used.

Starting with MDIO v1, we extract X/Y coordinates (such as CDP-X/Y, Shot-X/Y, etc.) as dedicated MDIO variables for easier access and manipulation. For these coordinates to be extracted correctly, the coordinate scalar must be valid. If we attempt to ingest the file with an invalid coordinate scalar, MDIO will raise an error. Let’s try to ingest the file and catch the resulting error to demonstrate this issue.

mdio_template = get_template("PostStack2DTime")

ingestion_kwargs = {
    "segy_spec": spec,
    "mdio_template": mdio_template,
    "input_path": "tmp.sgy",
    "output_path": "tmp.mdio",
    "overwrite": True,
}
try:
    segy_to_mdio(**ingestion_kwargs)
    print("Ingestion successful.")
except ValueError as e:
    print(f"Ingestion failed with error: {e}")
Ingestion failed with error: Invalid coordinate scalar: 0 for file revision SegyStandard.REV1.

Fixing the Coordinate Scalar

To be able to read this file without issues, we can utilize the SegyHeaderOverrides option to override the existing value at runtime and also have the correct value in the final MDIO file. With the value -100 we expect the coordinates to be divided by 100.

overrides = SegyHeaderOverrides(trace_header={"coordinate_scalar": -100})

segy_to_mdio(**ingestion_kwargs, segy_header_overrides=overrides)
print("Ingestion successful.")
Unexpected value in coordinate unit (measurement_system_code) header: 0. Can't extract coordinate unit and will ingest without coordinate units.
Ingestion successful.

Now that the ingestion has completed successfully, we can open the MDIO file and inspect its contents to verify that everything was processed correctly.

ds = open_mdio("tmp.mdio")
ds
<xarray.Dataset> Size: 55kB
Dimensions:     (cdp: 10, time: 1201)
Coordinates:
  * cdp         (cdp) int32 40B 0 1 2 3 4 5 6 7 8 9
  * time        (time) int32 5kB 0 4 8 12 16 20 ... 4784 4788 4792 4796 4800
    cdp_y       (cdp) float64 80B ...
    cdp_x       (cdp) float64 80B ...
Data variables:
    trace_mask  (cdp) bool 10B ...
    headers     (cdp) [('trace_seq_num_line', '<i4'), ('trace_seq_num_reel', '<i4'), ('orig_field_record_num', '<i4'), ('trace_num_orig_record', '<i4'), ('energy_source_point_num', '<i4'), ('trace_num_ensemble', '<i4'), ('trace_id_code', '<i2'), ('vertically_summed_traces', '<i2'), ('horizontally_stacked_traces', '<i2'), ('data_use', '<i2'), ('source_to_receiver_distance', '<i4'), ('receiver_group_elevation', '<i4'), ('source_surface_elevation', '<i4'), ('source_depth_below_surface', '<i4'), ('receiver_datum_elevation', '<i4'), ('source_datum_elevation', '<i4'), ('source_water_depth', '<i4'), ('receiver_water_depth', '<i4'), ('elevation_depth_scalar', '<i2'), ('coordinate_scalar', '<i2'), ('source_coord_x', '<i4'), ('source_coord_y', '<i4'), ('group_coord_x', '<i4'), ('group_coord_y', '<i4'), ('coordinate_unit', '<i2'), ('weathering_velocity', '<i2'), ('subweathering_velocity', '<i2'), ('source_uphole_time', '<i2'), ('group_uphole_time', '<i2'), ('source_static_correction', '<i2'), ('receiver_static_correction', '<i2'), ('total_static_applied', '<i2'), ('lag_time_a', '<i2'), ('lag_time_b', '<i2'), ('delay_recording_time', '<i2'), ('mute_time_start', '<i2'), ('mute_time_end', '<i2'), ('samples_per_trace', '<i2'), ('sample_interval', '<i2'), ('instrument_gain_type', '<i2'), ('instrument_gain_const', '<i2'), ('instrument_gain_initial', '<i2'), ('correlated_data', '<i2'), ('sweep_freq_start', '<i2'), ('sweep_freq_end', '<i2'), ('sweep_length', '<i2'), ('sweep_type', '<i2'), ('sweep_taper_start', '<i2'), ('sweep_taper_end', '<i2'), ('taper_type', '<i2'), ('alias_filter_freq', '<i2'), ('alias_filter_slope', '<i2'), ('notch_filter_freq', '<i2'), ('notch_filter_slope', '<i2'), ('low_cut_freq', '<i2'), ('high_cut_freq', '<i2'), ('low_cut_slope', '<i2'), ('high_cut_slope', '<i2'), ('year_recorded', '<i2'), ('day_of_year', '<i2'), ('hour_of_day', '<i2'), ('minute_of_hour', '<i2'), ('second_of_minute', '<i2'), ('time_basis_code', '<i2'), ('trace_weighting_factor', '<i2'), ('group_num_roll_switch', '<i2'), ('group_num_first_trace', '<i2'), ('group_num_last_trace', '<i2'), ('gap_size', '<i2'), ('taper_overtravel', '<i2'), ('cdp_x', '<i4'), ('cdp_y', '<i4'), ('inline', '<i4'), ('crossline', '<i4'), ('shot_point', '<i4'), ('shot_point_scalar', '<i2'), ('trace_value_unit', '<i2'), ('transduction_const_mantissa', '<i4'), ('transduction_const_exponent', '<i2'), ('transduction_unit', '<i2'), ('device_trace_id', '<i2'), ('times_scalar', '<i2'), ('source_type_orientation', '<i2'), ('source_energy_dir_mantissa', '<i4'), ('source_energy_dir_exponent', '<i2'), ('source_measurement_mantissa', '<i4'), ('source_measurement_exponent', '<i2'), ('source_measurement_unit', '<i2'), ('cdp', '<i4')] 2kB ...
    amplitude   (cdp, time) float32 48kB ...
Attributes:
    apiVersion:  1.0.8
    createdOn:   2025-10-20 15:51:40.006617+00:00
    name:        PostStack2DTime
    attributes:  {'surveyType': '2D', 'gatherType': 'stacked', 'defaultVariab...

Verifying the Coordinate Scaling

Let’s verify that the CDP-X/Y coordinates have been correctly scaled according to the coordinate scalar value we set. Since we used a coordinate scalar of -100, the coordinate values should be divided by 100. As expected, the coordinates are properly scaled.

ds[["cdp_x", "cdp_y"]].compute()
<xarray.Dataset> Size: 200B
Dimensions:  (cdp: 10)
Coordinates:
  * cdp      (cdp) int32 40B 0 1 2 3 4 5 6 7 8 9
    cdp_x    (cdp) float64 80B 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0
    cdp_y    (cdp) float64 80B 0.0 100.0 200.0 300.0 ... 600.0 700.0 800.0 900.0
Data variables:
    *empty*
Attributes:
    apiVersion:  1.0.8
    createdOn:   2025-10-20 15:51:40.006617+00:00
    name:        PostStack2DTime
    attributes:  {'surveyType': '2D', 'gatherType': 'stacked', 'defaultVariab...

We can also verify that the coordinate scalar was properly handled during ingestion by examining the first trace header. This confirms that MDIO has correctly processed and stored the coordinate scalar information.

ds.headers[0].values["coordinate_scalar"]
array(-100, dtype=int16)

Fixing X/Y Units Issues

You may have noticed that the CDP-X/Y coordinate units were not properly ingested, and a warning was displayed during the ingestion process. This occurs because the measurement_system_code field in the binary header is set to 0, which is invalid according to the SEG-Y standard. Valid values are 1 for meters and 2 for feet.

Fortunately, we can also override the binary header values during ingestion to ensure the units are correctly interpreted and stored in the MDIO file. Let’s fix both the coordinate scalar and the measurement system code simultaneously.

overrides = SegyHeaderOverrides(
    binary_header={"measurement_system_code": 1},
    trace_header={"coordinate_scalar": -100},
)

segy_to_mdio(**ingestion_kwargs, segy_header_overrides=overrides)
print("Ingestion successful.")
Ingestion successful.

Verifying the Units

Now let’s verify that both the coordinate scaling and the measurement units have been correctly applied. We can inspect the units stored in the MDIO file’s variable attributes. Since we set the measurement_system_code to 1, the coordinates should now have their units properly identified as meters.

ds = open_mdio("tmp.mdio")
print(f"CDP-X/Y Units: {ds['cdp_x'].attrs['unitsV1']} / {ds['cdp_y'].attrs['unitsV1']}")
CDP-X/Y Units: {'length': 'm'} / {'length': 'm'}

Perfect! The coordinate units are now correctly identified as meters. By using the SegyHeaderOverrides configuration, we successfully corrected both the invalid coordinate scalar and the missing measurement system code, ensuring that the MDIO file contains accurate coordinate information with proper units.

Summary

In this tutorial, we demonstrated how to handle common SEG-Y file issues using MDIO’s header override functionality:

  1. Invalid Coordinate Scalar: We showed how to override incorrect or zero coordinate scalar values to ensure proper coordinate extraction and scaling.

  2. Missing Measurement Units: We demonstrated how to set the measurement system code to ensure coordinate units are correctly identified in the output MDIO file.

The SegyHeaderOverrides feature provides a flexible way to work with imperfect SEG-Y files without needing to modify the original files, making it easier to ingest real-world datasets that may not strictly follow the SEG-Y standard.