{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Get Started in 10 Minutes\n", "\n", "```{article-info}\n", ":author: Altay Sansal\n", ":date: \"{sub-ref}`today`\"\n", ":read-time: \"{sub-ref}`wordcount-minutes` min read\"\n", ":class-container: sd-p-0 sd-outline-muted sd-rounded-3 sd-font-weight-light\n", "```\n", "\n", "In this page we will be showing basic capabilities of MDIO.\n", "\n", "For demonstration purposes, we will ingest the **remote** Teapot Dome open-source dataset.\n", "The dataset details and licensing can be found at the [SEG Wiki](https://wiki.seg.org/wiki/Teapot_dome_3D_survey).\n", "\n", "We are using the 3D seismic stack dataset named `filt_mig.sgy`.\n", "\n", "The full HTTP link for the dataset (hosted on AWS): http://s3.amazonaws.com/teapot/filt_mig.sgy\n", "\n", "```{warning}\n", "For plotting and remote ingestion the notebook requires [Matplotlib](https://matplotlib.org/) and `aiohttp`\n", "as a dependency. Please install it before executing using `pip install matplotlib aiohttp` or\n", " `conda install matplotlib aiohttp`.\n", "```\n", "\n", "## Defining the SEG-Y Dataset\n", "\n", "Since MDIO 0.8 we can directly ingest remote SEG-Y files! The file is 386 MB in size. To make the header scan performant we can also\n", "set up an environment variable for MDIO. See here for when to use this:\n", "[Buffered Reads in Ingestion](https://mdio-python.readthedocs.io/en/stable/usage.html#buffered-reads-in-ingestion).\n", "\n", "The dataset is irregularly shaped, however it is padded to a rectangle with zeros (dead traces). We will see that later at the live mask plotting.\n", "\n", "The following environment variables are essential here:\n", "- `MDIO__IMPORT__CLOUD_NATIVE` tells MDIO to do buffered reads for headers due to remote file.\n", "- `MDIO__IMPORT__SAVE_SEGY_FILE_HEADER` makes MDIO save the SEG-Y specific file headers (text, binary) which is not strictly necessary for consumption and is disabled by default." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "os.environ[\"MDIO__IMPORT__CLOUD_NATIVE\"] = \"true\"\n", "os.environ[\"MDIO__IMPORT__SAVE_SEGY_FILE_HEADER\"] = \"true\"\n", "\n", "input_url = \"http://s3.amazonaws.com/teapot/filt_mig.sgy\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ingesting to MDIO\n", "\n", "To do this, we can use the convenient SEG-Y to MDIO converter.\n", "\n", "The `inline` and `crossline` values are located at bytes `181` and `185`. Note that this doesn't match any SEG-Y standards.\n", "\n", "MDIO uses [TGSAI/segy][segy-github] to parse the SEG-Y; the field names conform to its canonical keys defined in [SEGY Binary Header Keys][segy-binary-keys] and [SEGY Trace Header Keys][segy-trace-keys]. Since MDIO v1 we also introduced templates for common seismic data types. For instance, we will be using the `PostStack3DTime` template [here][mdio-template-poststack3dtime], which expects the same canonical keys.\n", "\n", "We will also specify the units for the time domain. The spatial units will be automatically parsed from SEG-Y binary header. However, there may be a case where it is corrupt in the file, for that see the [Fixing X/Y Units Issues](https://mdio-python.readthedocs.io/en/stable/tutorials/corrupt_files.html#fixing-x-y-units-issues) section.\n", "\n", "In summary, we will use the byte locations as defined for ingestion.\n", "\n", "[segy-github]: https://github.com/TGSAI/segy\n", "[segy-binary-keys]: https://github.com/TGSAI/segy/blob/main/src/segy/standards/fields/binary.py\n", "[segy-trace-keys]: https://github.com/TGSAI/segy/blob/main/src/segy/standards/fields/trace.py\n", "[mdio-template-poststack3dtime]: https://github.com/TGSAI/mdio-python/blob/main/src/mdio/builder/templates/seismic_3d_poststack.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "from segy.schema import HeaderField\n", "from segy.standards import get_segy_standard\n", "\n", "from mdio import segy_to_mdio\n", "from mdio.builder.schemas.v1.units import TimeUnitModel\n", "from mdio.builder.template_registry import get_template\n", "\n", "teapot_trace_headers = [\n", " HeaderField(name=\"inline\", byte=181, format=\"int32\"),\n", " HeaderField(name=\"crossline\", byte=185, format=\"int32\"),\n", " HeaderField(name=\"cdp_x\", byte=189, format=\"int32\"),\n", " HeaderField(name=\"cdp_y\", byte=193, format=\"int32\"),\n", "]\n", "\n", "rev0_segy_spec = get_segy_standard(0)\n", "teapot_segy_spec = rev0_segy_spec.customize(trace_header_fields=teapot_trace_headers)\n", "\n", "mdio_template = get_template(\"PostStack3DTime\")\n", "unit_ms = TimeUnitModel(time=\"ms\")\n", "mdio_template.add_units({\"time\": unit_ms})\n", "\n", "segy_to_mdio(\n", " input_path=input_url,\n", " output_path=\"filt_mig.mdio\",\n", " segy_spec=teapot_segy_spec,\n", " mdio_template=mdio_template,\n", " overwrite=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It only took a few seconds to ingest, since this is a very small file.\n", "\n", "However, MDIO scales up to TB (that's ~1,000 GB) sized volumes!\n", "\n", "## Opening the Ingested MDIO File\n", "\n", "Let's open the MDIO file with the `open_mdio` function. This will return a pretty [`xarray`][xarray] representation with our standardized format.\n", "\n", "[xarray]: https://docs.xarray.dev" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from mdio import open_mdio\n", "\n", "dataset = open_mdio(\"filt_mig.mdio\")\n", "dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Querying Metadata\n", "\n", "Now let's look at the file text header saved in the `segy_file_header` metadata variable." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(dataset[\"segy_file_header\"].attrs[\"textHeader\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we saved the binary header, we can look at that as well." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset[\"segy_file_header\"].attrs[\"binaryHeader\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## MDIO Grid, Dimensions, and Related Attributes\n", "\n", "MDIO has an abstraction for an N-dimensional grid.\n", "MDIO also has named dimensions, so we can see which dimension (axis) corresponds to which name." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset.sizes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset.inline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset.crossline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fetching Data and Plotting\n", "\n", "Now we will demonstrate getting an inline from MDIO.\n", "\n", "Since MDIO v1 we are using Xarray under the hood, so we can use its convenient indexing. It also handles\n", "the plotting with proper dimension coordinate labels.\n", "\n", "MDIO stores summary statistics. We can calculate the standard deviation (std) value of the dataset to adjust the gain." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from mdio.builder.schemas.v1.stats import SummaryStatistics\n", "\n", "stats = SummaryStatistics.model_validate_json(dataset[\"amplitude\"].attrs[\"statsV1\"])\n", "std = ((stats.sum_squares / stats.count) - (stats.sum / stats.count) ** 2) ** 0.5\n", "\n", "il_dataset = dataset.sel(inline=278)\n", "il_amp = il_dataset[\"amplitude\"].T\n", "il_amp.plot(vmin=-2 * std, vmax=2 * std, cmap=\"gray_r\", yincrease=False);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's do the same with a time slice.\n", "\n", "We will display two-way-time at 1,000 ms.\n", "\n", "Note that since we parse the X/Y coordinates, we can plot time slice in real world coordinates." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "twt_data = dataset[\"amplitude\"].sel(time=1000)\n", "twt_data.plot(vmin=-2 * std, vmax=2 * std, cmap=\"gray_r\", x=\"cdp_x\", y=\"cdp_y\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also overlay live mask with the time slice. However, in this example, the dataset is zero-padded.\n", "\n", "The live trace mask will always show True (yellow)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "trace_mask = dataset.trace_mask[:]\n", "\n", "twt_data.plot(vmin=-2 * std, vmax=2 * std, cmap=\"gray_r\", x=\"cdp_x\", y=\"cdp_y\", alpha=0.5, figsize=(8, 5))\n", "trace_mask.plot(vmin=0, vmax=1, x=\"cdp_x\", y=\"cdp_y\", alpha=0.5);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Query Headers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can query headers for the whole dataset very quickly because they are separated from the seismic wavefield.\n", "\n", "Let's get all the headers for X and Y coordinates in this dataset.\n", "\n", "Note that the header maps will still share the geometry/grid of the dataset!\n", "\n", "The `compute` property fetches the lazily opened MDIO values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset.headers[\"cdp_x\"].compute()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset.headers[\"cdp_y\"].compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we mentioned before, we can also get specific dataset slices of headers while fetching a slice.\n", "\n", "Let's fetch a crossline; we are still using some previous parameters.\n", "\n", "Since the sliced dataset contains the headers as well, we can plot the headers on top of the image.\n", "\n", "Full headers can be mapped and plotted as well, but we won't demonstrate that here." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "xl_dataset = dataset.sel(crossline=100) # slices everything available in MDIO dataset!\n", "\n", "cdp_x_header = xl_dataset[\"cdp_x\"]\n", "cdp_y_header = xl_dataset[\"cdp_y\"]\n", "image = xl_dataset[\"amplitude\"].T\n", "\n", "# Build plot from here\n", "gs_kw = {\"height_ratios\": (1, 8)}\n", "fig, (hdr_ax, img_ax) = plt.subplots(2, 1, figsize=(6, 6), gridspec_kw=gs_kw, sharex=\"all\")\n", "hdr_ax2 = hdr_ax.twinx()\n", "\n", "cdp_x_header.plot(ax=hdr_ax, c=\"red\")\n", "cdp_y_header.plot(ax=hdr_ax2, c=\"black\")\n", "image.plot(ax=img_ax, vmin=-2 * std, vmax=2 * std, cmap=\"gray_r\", yincrease=False, add_colorbar=False)\n", "img_ax.set_title(\"\")\n", "hdr_ax.set_xlabel(\"\")\n", "\n", "plt.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## MDIO to SEG-Y Conversion\n", "\n", "Finally, let's demonstrate going back to SEG-Y.\n", "\n", "We will use the convenient `mdio_to_segy` function and write it out as a round-trip file.\n", "\n", "The output spec can be modified if we want to write things to different byte locations, etc, but we will use the same one as before." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from mdio import mdio_to_segy\n", "\n", "mdio_to_segy(\n", " input_path=\"filt_mig.mdio\",\n", " output_path=\"filt_mig_roundtrip.sgy\",\n", " segy_spec=teapot_segy_spec,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Validate Round-Trip SEG-Y File\n", "\n", "We can validate if the round-trip SEG-Y file is matching the original using [TGSAI/segy][segy-github].\n", "\n", "Step by step:\n", "* Open original file\n", "* Open round-trip file\n", "* Compare text headers\n", "* Compare binary headers\n", "* Compare 100 random headers and traces\n", "\n", "[segy-github]: https://github.com/TGSAI/segy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from segy import SegyFile\n", "\n", "original_segy = SegyFile(input_url)\n", "roundtrip_segy = SegyFile(\"filt_mig_roundtrip.sgy\")\n", "\n", "# Compare text header\n", "assert original_segy.text_header == roundtrip_segy.text_header\n", "\n", "# Compare bin header\n", "assert original_segy.binary_header == roundtrip_segy.binary_header\n", "\n", "# Compare 100 random trace headers and traces\n", "rng = np.random.default_rng()\n", "rand_indices = rng.integers(low=0, high=original_segy.num_traces, size=100)\n", "for idx in rand_indices:\n", " np.testing.assert_equal(original_segy.trace[idx], roundtrip_segy.trace[idx])\n", "\n", "print(\"Files identical!\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3" }, "mystnb": { "execution_mode": "force" } }, "nbformat": 4, "nbformat_minor": 4 }