Source code for mojito.download

"""
Download data
=============

This module provides functions to download Mojito L1 data files from the CSC's
Nextcloud server (brick market) and retrieve source parameters, handling
authentication and caching.

.. attention::

    **Publications using Mojito data are currently not allowed!** Please keep in
    touch, as publication policies will soon be published.

.. note::

    Authentication using a personal Nextcloud access token is required to
    download Mojito data. Please refer to :ref:`authentication` for more
    information on how to generate an access token and authenticate.

Downloading bricks
------------------

The main function to download data files is :func:`download_brick`, which takes
the type of brick (mbhb, gb, noise, combined, etc.) and an optional source
identifier as arguments.

The function constructs the appropriate download URL based on the brick type and
source identifier, then downloads the file using authenticated access to the
Nextcloud server. The files are cached locally to avoid redundant downloads,
such that only a path to the cached file is returned on subsequent calls.

.. code-block:: python

    from mojito.download import download_brick

    # Download mbhb brick for source ID 12
    brick_path = download_brick("mbhb", source_id=12)

You can provide authentication credentials via keyword parameters ``username``
and ``token``. If not provided, the function will look for environment variables
``MOJITO_USERNAME`` and ``MOJITO_TOKEN``. If still not found, the user will be
prompted to enter them. Go to :ref:`authentication` for more details on how to
generate an ccess token and authenticate.

.. code-block:: python

    # Download mbhb brick for source ID 12 with explicit credentials
    brick_path = download_brick(
        "mbhb", source_id=12, username="my_username", token="my_token"
    )

By default, the latest version of the brick is downloaded. To download a
specific version, you can use the ``version`` parameter.

.. code-block:: python

    newest_brick_path = download_brick("mbhb", source_id=12, version="latest")
    older_brick_path = download_brick("mbhb", source_id=12, version=2)

By default, the full official L1 version of the brick is downloaded. To download
a reduced version (with downsampled data and potentially missing quantities,
such as LTTs and orbits), you can use the ``reduced`` parameter.

.. code-block:: python

    full_brick_path = download_brick("mbhb", source_id=12, reduced=False)
    reduced_brick_path = download_brick("mbhb", source_id=12, reduced=True)

.. warning::

    Reduced versions are not official L1 files and are not representative of
    real data, as sampling rates are not correct and some quantities might be
    missing. Use them only for testing and development purposes.

.. autofunction:: download_brick

Source parameters
-----------------

You can retrieve source parameters from the relevant catalog files using the
:func:`get_source_params` function. This function downloads the catalog if
needed and extracts the parameters for the specified source identifier.

.. code-block:: python

    from mojito.download import get_source_params

    # Get source parameters for mbhb brick, source ID 12
    params = get_source_params("mbhb", source_id=12)

You can also download the entire catalog file using the :func:`download_catalog`
function.

.. code-block:: python

    from mojito.download import download_catalog

    # Download mbhb catalog
    catalog_path = download_catalog("mbhb")

By default, the latest version of the catalog is downloaded. To download a
specific version, you can use the ``version`` parameter.

.. code-block:: python

    newest_catalog_path = download_catalog("mbhb", version="latest")
    older_catalog_path = download_catalog("mbhb", version=2)

.. autofunction:: download_catalog

.. autofunction:: get_source_params

.. _authentication:

Authentication
--------------

Mojito requires authentication to access the Nextcloud server hosting the data
files. This is done using a username and an access token.

Generate access tokens
~~~~~~~~~~~~~~~~~~~~~~

You first need to create an access token from the NextCloud Web Portal (see
:data:`NEXTCLOUD_BASE_URL`). Go to your account settings, navigate to the
"Security" section, and create a new application password (aka access token).

You can use any application name; we recommend using "Mojito".

*Remember to store this token securely, as it will not be shown again.* You can
save it as an environment variable; see below.

Authentication methods
~~~~~~~~~~~~~~~~~~~~~~

You can provide the username and token to Mojito in three ways:

1. **Environment Variables**: Set the environment variables ``MOJITO_USERNAME``
   and ``MOJITO_TOKEN`` with your credentials. Mojito will automatically use
   these values when downloading data.

2. **Function Parameters**: Pass the ``username`` and ``token`` parameters
   directly to the download functions like :func:`download_brick`,
   :func:`download_catalog`, or :func:`get_source_params`. This method overrides
   any environment variable settings.

3. **User Prompt**: If neither environment variables nor function parameters are
   provided, Mojito will prompt you to enter your username and token
   interactively when a download is attempted.

.. tip::

    You can save your username and token in your shell's configuration file
    to avoid entering them each time. For example, add the following lines to
    your ``.bashrc`` or ``.zshrc`` file:

    .. code-block:: bash

        # .bashrc or .zshrc
        # ...
        export MOJITO_USERNAME="your_username"
        export MOJITO_TOKEN="your_token"

.. autofunction:: get_credentials

Caching
-------

By default, files are cached in the directory following the `XDG Base Directory
Specification <https://specifications.freedesktop.org/basedir/latest/>`_. This
can be overridden by setting the environment variable ``MOJITO_CACHE_DIR`` or by
passing a custom path when calling :func:`download_brick` or
:func:`download_catalog`.

.. warning::

    Do not move or rename the cached files, as this will prevent Mojito from
    locating them in future calls.

.. note::

    We suggest to change the cache directory to a shared filesystem location if
    you are using Mojito on a computing cluster, to avoid downloading the same
    files multiple times for different compute nodes or users.

.. autofunction:: get_cache_dir

.. autofunction:: clear_cache

Downloading arbitrary files
---------------------------

The :func:`download_file` function can be used to download any file from the
Nextcloud server, given its URL. This can be useful for downloading files that
are not listed in the registry, but be cautious when using this method, as it
can bypass security checks. Always ensure that you are downloading from trusted
sources when using this function.

.. autofunction:: download_file

Other download methods
----------------------

In addition to the provided functions, a command-line interface is available. It
allows you to download bricks, catalogs, or clear the cache directly from the
terminal. Please consult :doc:`cli` for more information.

One can also download files manually using a web browser or command-line tools
like ``rclone``. The base URL for accessing the `globalstorage` directory on
Nextcloud is given by :data:`NEXTCLOUD_BASE_URL`.

.. autodata:: NEXTCLOUD_BASE_URL

"""

import getpass
import logging
import shutil
from os import environ
from os.path import expanduser
from typing import Any, Literal

import h5py
import platformdirs
import pooch

from .registry import FILE_REGISTRY

logger = logging.getLogger(__name__)


NEXTCLOUD_BASE_URL = "https://nextcloud-dcc-fi-csc-okd-globalstorage1.2.rahtiapp.fi"
"""Base URL for `globalstorage` on Nextcloud (rahtiapp.fi)."""


[docs] def get_cache_dir(cache_dir: str | None = None) -> str: """Get the cache directory for Mojito data files. If ``cache_dir`` is provided, it is used directly. Otherwise, first check the environment variable ``MOJITO_CACHE_DIR``. If not set, use the default location from :func:`platformdirs.user_cache_dir` with "mojito" as the appname. Parameters ---------- cache_dir The cache directory path, or None to use the environment variable or default location. Returns ------- The cache directory path. """ if cache_dir is not None: logger.debug("Using explicit cache directory: %s", cache_dir) return cache_dir cache_dir = environ.get("MOJITO_CACHE_DIR") if cache_dir is not None: logger.debug("Using cache directory from environment: %s", cache_dir) return cache_dir cache_dir = platformdirs.user_cache_dir("mojito") logger.debug("Using XDG cache directory: %s", cache_dir) return cache_dir
[docs] def clear_cache(cache_dir: str | None = None) -> None: """Delete all cached Mojito data files. .. warning:: This will remove all files in the Mojito cache directory. Large files might need to be re-downloaded afterwards. """ cache_dir = get_cache_dir(cache_dir) cache_dir = expanduser(cache_dir) logger.info("Deleting Mojito cache directory: %s", cache_dir) shutil.rmtree(cache_dir, ignore_errors=True)
[docs] def get_credentials( username: str | None = None, token: str | None = None, ) -> tuple[str, str]: """Get Nextcloud credentials for authentication. If ``username`` and ``token`` are provided, use them directly. If one of them is not provided, first check the environment variables ``MOJITO_USERNAME`` and ``MOJITO_TOKEN`` and return them if both are set. If still not set, prompt the user for input. Parameters ---------- username The Nextcloud username for authentication. token The Nextcloud token for authentication. Returns ------- username The Nextcloud username and token for authentication. token The Nextcloud token for authentication. """ # If credentials not provided, check environment variables if username is None or token is None: username = environ.get("MOJITO_USERNAME") token = environ.get("MOJITO_TOKEN") if username is not None and token is not None: logger.debug("Using Nextcloud credentials from environment variables") # If still not provided, prompt the user if username is None or token is None: username = input("Enter Nextcloud username: ") token = getpass.getpass("Enter Nextcloud token: ") if username is not None and token is not None: logger.debug("Using Nextcloud credentials provided by user input") return username, token
def _nextcloud_download_endpoint(username: str) -> str: """Get the Nextcloud download endpoint for a given username. Parameters ---------- username The Nextcloud username. Returns ------- The Nextcloud download endpoint URL for the specified username. """ return f"{NEXTCLOUD_BASE_URL}/remote.php/dav/files/{username}"
[docs] def download_file( download_url: str, cache_dir: str | None = None, username: str | None = None, token: str | None = None, *, progressbar: bool = True, unsafe: bool = False, ) -> str: """Download a Mojito data file from Nextcloud. .. warning:: Using ``unsafe=True`` bypasses the registry check and checksum verification, which can lead to security risks if downloading files from untrusted sources. Use with caution and only for trusted URLs. Parameters ---------- download_url The URL of the data file to download. cache_dir The cache directory path, or None to use the default XDG cache directory. username The Nextcloud username for authentication. If None, use the environment variable or prompt the user. See :func:`get_credentials`. token The Nextcloud token/password for authentication. If None, use the environment variable or prompt the user.. See :func:`get_credentials`. progressbar Whether to show a progress bar during download. unsafe Whether to use pooch's ``retrieve`` method instead of ``fetch``. This can be used to bypass the registry check and download files not listed in it, avoid checksum verification, and calculate and print the file's hash after download. Returns ------- Path to the downloaded data file. """ logger.info("Downloading %s...", download_url) # Create HTTP downloader with authentication username, token = get_credentials(username, token) downloader = pooch.HTTPDownloader( auth=(username, token), progressbar=progressbar, ) logger.debug("Using Nextcloud username '%s' for authentication", username) # Construct download endpoint endpoint = _nextcloud_download_endpoint(username) logger.debug("Using Nextcloud download endpoint: %s", endpoint) # Use 'retrieve' method if specified if unsafe: logger.info("Using 'retrieve' method to download file without registry check") cache_dir = get_cache_dir(cache_dir) url = f"{endpoint}/{download_url}" path = pooch.retrieve( url, downloader=downloader, known_hash=None, path=cache_dir ) return path # Create Pooch fetcher fetcher = pooch.create( path=get_cache_dir(cache_dir), base_url=endpoint, registry=FILE_REGISTRY.pooch_registry, ) # Fetch the file try: path = fetcher.fetch(download_url, downloader=downloader) except Exception as e: logger.error("Failed to download %s: %s", download_url, e) raise # Warn that publications cannot be made using Mojito data logger.warning( "WARNING! Publications using Mojito data are currently not allowed! " "Please keep in touch, as publication policies will soon be published." ) return path
SignalBrickType = Literal["emri", "mbhb", "sobhb", "vgb", "gb"] """Type of Mojito signal brick.""" NoiseBrickType = Literal["noise"] """Type of Mojito noise brick.""" CombinedBrickType = Literal["combined"] """Type of Mojito combined brick.""" BrickType = SignalBrickType | NoiseBrickType | CombinedBrickType """Type of Mojito data brick (signal, noise or combined)."""
[docs] def download_brick( brick: BrickType, source_id: int | None = None, *, version: int | Literal["latest"] = "latest", reduced: bool = False, cache_dir: str | None = None, username: str | None = None, token: str | None = None, ) -> str: """Download a Mojito data file for the specified brick type and source ID. Parameters ---------- brick The type of data brick to download. source_id The source identifier to download data for. Only applicable for source-specific bricks like "mbhb", "emri", or "gb". version The version of the brick to download. Can be an integer version number or "latest" to download the most recent version. reduced Whether to download the reduced version instead of the full, official brick. Reduced versions have downsampled data and may be missing some quantities. They are not official L1 files and should only be used for testing and development purposes. cache_dir The cache directory path, or None to use the default XDG cache directory. username The Nextcloud username for authentication. If None, use the environment variable or prompt the user. See :func:`get_credentials`. token The Nextcloud token/password for authentication. If None, use the environment variable or prompt the user. See :func:`get_credentials`. Returns ------- Path to the downloaded data file. """ if reduced: logger.warning( "WARNING! Reduced versions are not official L1 files and are not " "representative of real data, as sampling rates are not correct " "and some quantities might be missing. Use them only for testing " "and development purposes." ) download_url = FILE_REGISTRY.get_brick_path( brick, source_id, version=version, reduced=reduced ) path = download_file(download_url, cache_dir, username, token) return path
[docs] def download_catalog( brick: SignalBrickType, *, version: int | Literal["latest"] = "latest", cache_dir: str | None = None, username: str | None = None, token: str | None = None, ) -> str: """Download a catalog for the specified brick type. Parameters ---------- brick The type of catalog to download. version The version of the catalog to download. Can be an integer version number or "latest" to download the most recent version. cache_dir The cache directory path, or None to use the default XDG cache directory. username The Nextcloud username for authentication. If None, use the environment variable or prompt the user. See :func:`get_credentials`. token The Nextcloud token/password for authentication. If None, use the environment variable or prompt the user. See :func:`get_credentials`. Returns ------- Path to the downloaded catalog file. """ download_url = FILE_REGISTRY.get_catalog_path(brick, version=version) path = download_file(download_url, cache_dir, username, token) return path
[docs] def get_source_params( brick: SignalBrickType, source_id: int, *, version: int | Literal["latest"] = "latest", cache_dir: str | None = None, username: str | None = None, token: str | None = None, ) -> dict[str, Any]: """Get source parameters from relevant catalog. If needed, this function downloads the source parameter catalog for the specified brick type; then retrieves parameters for the given source ID. Parameters ---------- brick The type of catalog to retrieve parameters from. source_id The source identifier to retreive parameters for. version The version of the catalog to download. Can be an integer version number or "latest" to download the most recent version. cache_dir The cache directory path, or None to use the default XDG cache directory. username The Nextcloud username for authentication. If None, use the environment variable or prompt the user. See :func:`get_credentials`. token The Nextcloud token/password for authentication. If None, use the environment variable or prompt the user. See :func:`get_credentials`. Returns ------- Source parameters for the specified source ID. """ params = {} catalog_path = download_catalog( brick, version=version, cache_dir=cache_dir, username=username, token=token, ) with h5py.File(catalog_path, "r") as f: binaries_group = f["Binaries"] assert isinstance(binaries_group, h5py.Group) for item in binaries_group: dataset = binaries_group[item] assert isinstance(dataset, h5py.Dataset) if dataset.size < source_id: raise ValueError( "Cannot retrieve params for '{brick}' (source {source_id})" ) params[item] = dataset[source_id] return params