"""
Download data
=============
This module provides functions to download Mojito L1 data files from the CSC's
Nextcloud server (brick market) and retrieve source parameters, handling
authentication and caching.
.. attention::
**Publications using Mojito data are currently not allowed!** Please keep in
touch, as publication policies will soon be published.
.. note::
Authentication using a personal Nextcloud access token is required to
download Mojito data. Please refer to :ref:`authentication` for more
information on how to generate an access token and authenticate.
Downloading bricks
------------------
The main function to download data files is :func:`download_brick`, which takes
the type of brick (mbhb, gb, noise, combined, etc.) and an optional source
identifier as arguments.
The function constructs the appropriate download URL based on the brick type and
source identifier, then downloads the file using authenticated access to the
Nextcloud server. The files are cached locally to avoid redundant downloads,
such that only a path to the cached file is returned on subsequent calls.
.. code-block:: python
from mojito.download import download_brick
# Download mbhb brick for source ID 12
brick_path = download_brick("mbhb", source_id=12)
You can provide authentication credentials via keyword parameters ``username``
and ``token``. If not provided, the function will look for environment variables
``MOJITO_USERNAME`` and ``MOJITO_TOKEN``. If still not found, the user will be
prompted to enter them. Go to :ref:`authentication` for more details on how to
generate an ccess token and authenticate.
.. code-block:: python
# Download mbhb brick for source ID 12 with explicit credentials
brick_path = download_brick(
"mbhb", source_id=12, username="my_username", token="my_token"
)
By default, the latest version of the brick is downloaded. To download a
specific version, you can use the ``version`` parameter.
.. code-block:: python
newest_brick_path = download_brick("mbhb", source_id=12, version="latest")
older_brick_path = download_brick("mbhb", source_id=12, version=2)
By default, the full official L1 version of the brick is downloaded. To download
a reduced version (with downsampled data and potentially missing quantities,
such as LTTs and orbits), you can use the ``reduced`` parameter.
.. code-block:: python
full_brick_path = download_brick("mbhb", source_id=12, reduced=False)
reduced_brick_path = download_brick("mbhb", source_id=12, reduced=True)
.. warning::
Reduced versions are not official L1 files and are not representative of
real data, as sampling rates are not correct and some quantities might be
missing. Use them only for testing and development purposes.
.. autofunction:: download_brick
Source parameters
-----------------
You can retrieve source parameters from the relevant catalog files using the
:func:`get_source_params` function. This function downloads the catalog if
needed and extracts the parameters for the specified source identifier.
.. code-block:: python
from mojito.download import get_source_params
# Get source parameters for mbhb brick, source ID 12
params = get_source_params("mbhb", source_id=12)
You can also download the entire catalog file using the :func:`download_catalog`
function.
.. code-block:: python
from mojito.download import download_catalog
# Download mbhb catalog
catalog_path = download_catalog("mbhb")
By default, the latest version of the catalog is downloaded. To download a
specific version, you can use the ``version`` parameter.
.. code-block:: python
newest_catalog_path = download_catalog("mbhb", version="latest")
older_catalog_path = download_catalog("mbhb", version=2)
.. autofunction:: download_catalog
.. autofunction:: get_source_params
.. _authentication:
Authentication
--------------
Mojito requires authentication to access the Nextcloud server hosting the data
files. This is done using a username and an access token.
Generate access tokens
~~~~~~~~~~~~~~~~~~~~~~
You first need to create an access token from the NextCloud Web Portal (see
:data:`NEXTCLOUD_BASE_URL`). Go to your account settings, navigate to the
"Security" section, and create a new application password (aka access token).
You can use any application name; we recommend using "Mojito".
*Remember to store this token securely, as it will not be shown again.* You can
save it as an environment variable; see below.
Authentication methods
~~~~~~~~~~~~~~~~~~~~~~
You can provide the username and token to Mojito in three ways:
1. **Environment Variables**: Set the environment variables ``MOJITO_USERNAME``
and ``MOJITO_TOKEN`` with your credentials. Mojito will automatically use
these values when downloading data.
2. **Function Parameters**: Pass the ``username`` and ``token`` parameters
directly to the download functions like :func:`download_brick`,
:func:`download_catalog`, or :func:`get_source_params`. This method overrides
any environment variable settings.
3. **User Prompt**: If neither environment variables nor function parameters are
provided, Mojito will prompt you to enter your username and token
interactively when a download is attempted.
.. tip::
You can save your username and token in your shell's configuration file
to avoid entering them each time. For example, add the following lines to
your ``.bashrc`` or ``.zshrc`` file:
.. code-block:: bash
# .bashrc or .zshrc
# ...
export MOJITO_USERNAME="your_username"
export MOJITO_TOKEN="your_token"
.. autofunction:: get_credentials
Caching
-------
By default, files are cached in the directory following the `XDG Base Directory
Specification <https://specifications.freedesktop.org/basedir/latest/>`_. This
can be overridden by setting the environment variable ``MOJITO_CACHE_DIR`` or by
passing a custom path when calling :func:`download_brick` or
:func:`download_catalog`.
.. warning::
Do not move or rename the cached files, as this will prevent Mojito from
locating them in future calls.
.. note::
We suggest to change the cache directory to a shared filesystem location if
you are using Mojito on a computing cluster, to avoid downloading the same
files multiple times for different compute nodes or users.
.. autofunction:: get_cache_dir
.. autofunction:: clear_cache
Downloading arbitrary files
---------------------------
The :func:`download_file` function can be used to download any file from the
Nextcloud server, given its URL. This can be useful for downloading files that
are not listed in the registry, but be cautious when using this method, as it
can bypass security checks. Always ensure that you are downloading from trusted
sources when using this function.
.. autofunction:: download_file
Other download methods
----------------------
In addition to the provided functions, a command-line interface is available. It
allows you to download bricks, catalogs, or clear the cache directly from the
terminal. Please consult :doc:`cli` for more information.
One can also download files manually using a web browser or command-line tools
like ``rclone``. The base URL for accessing the `globalstorage` directory on
Nextcloud is given by :data:`NEXTCLOUD_BASE_URL`.
.. autodata:: NEXTCLOUD_BASE_URL
"""
import getpass
import logging
import shutil
from os import environ
from os.path import expanduser
from typing import Any, Literal
import h5py
import platformdirs
import pooch
from .registry import FILE_REGISTRY
logger = logging.getLogger(__name__)
NEXTCLOUD_BASE_URL = "https://nextcloud-dcc-fi-csc-okd-globalstorage1.2.rahtiapp.fi"
"""Base URL for `globalstorage` on Nextcloud (rahtiapp.fi)."""
[docs]
def get_cache_dir(cache_dir: str | None = None) -> str:
"""Get the cache directory for Mojito data files.
If ``cache_dir`` is provided, it is used directly. Otherwise, first check
the environment variable ``MOJITO_CACHE_DIR``. If not set, use the default
location from :func:`platformdirs.user_cache_dir` with "mojito" as the
appname.
Parameters
----------
cache_dir
The cache directory path, or None to use the environment variable or
default location.
Returns
-------
The cache directory path.
"""
if cache_dir is not None:
logger.debug("Using explicit cache directory: %s", cache_dir)
return cache_dir
cache_dir = environ.get("MOJITO_CACHE_DIR")
if cache_dir is not None:
logger.debug("Using cache directory from environment: %s", cache_dir)
return cache_dir
cache_dir = platformdirs.user_cache_dir("mojito")
logger.debug("Using XDG cache directory: %s", cache_dir)
return cache_dir
[docs]
def clear_cache(cache_dir: str | None = None) -> None:
"""Delete all cached Mojito data files.
.. warning::
This will remove all files in the Mojito cache directory. Large files
might need to be re-downloaded afterwards.
"""
cache_dir = get_cache_dir(cache_dir)
cache_dir = expanduser(cache_dir)
logger.info("Deleting Mojito cache directory: %s", cache_dir)
shutil.rmtree(cache_dir, ignore_errors=True)
[docs]
def get_credentials(
username: str | None = None,
token: str | None = None,
) -> tuple[str, str]:
"""Get Nextcloud credentials for authentication.
If ``username`` and ``token`` are provided, use them directly. If one of
them is not provided, first check the environment variables
``MOJITO_USERNAME`` and ``MOJITO_TOKEN`` and return them if both are set. If
still not set, prompt the user for input.
Parameters
----------
username
The Nextcloud username for authentication.
token
The Nextcloud token for authentication.
Returns
-------
username
The Nextcloud username and token for authentication.
token
The Nextcloud token for authentication.
"""
# If credentials not provided, check environment variables
if username is None or token is None:
username = environ.get("MOJITO_USERNAME")
token = environ.get("MOJITO_TOKEN")
if username is not None and token is not None:
logger.debug("Using Nextcloud credentials from environment variables")
# If still not provided, prompt the user
if username is None or token is None:
username = input("Enter Nextcloud username: ")
token = getpass.getpass("Enter Nextcloud token: ")
if username is not None and token is not None:
logger.debug("Using Nextcloud credentials provided by user input")
return username, token
def _nextcloud_download_endpoint(username: str) -> str:
"""Get the Nextcloud download endpoint for a given username.
Parameters
----------
username
The Nextcloud username.
Returns
-------
The Nextcloud download endpoint URL for the specified username.
"""
return f"{NEXTCLOUD_BASE_URL}/remote.php/dav/files/{username}"
[docs]
def download_file(
download_url: str,
cache_dir: str | None = None,
username: str | None = None,
token: str | None = None,
*,
progressbar: bool = True,
unsafe: bool = False,
) -> str:
"""Download a Mojito data file from Nextcloud.
.. warning::
Using ``unsafe=True`` bypasses the registry check and checksum
verification, which can lead to security risks if downloading files from
untrusted sources. Use with caution and only for trusted URLs.
Parameters
----------
download_url
The URL of the data file to download.
cache_dir
The cache directory path, or None to use the default XDG cache
directory.
username
The Nextcloud username for authentication. If None, use the environment
variable or prompt the user. See :func:`get_credentials`.
token
The Nextcloud token/password for authentication. If None, use the
environment variable or prompt the user.. See :func:`get_credentials`.
progressbar
Whether to show a progress bar during download.
unsafe
Whether to use pooch's ``retrieve`` method instead of ``fetch``. This
can be used to bypass the registry check and download files not listed
in it, avoid checksum verification, and calculate and print the file's
hash after download.
Returns
-------
Path to the downloaded data file.
"""
logger.info("Downloading %s...", download_url)
# Create HTTP downloader with authentication
username, token = get_credentials(username, token)
downloader = pooch.HTTPDownloader(
auth=(username, token),
progressbar=progressbar,
)
logger.debug("Using Nextcloud username '%s' for authentication", username)
# Construct download endpoint
endpoint = _nextcloud_download_endpoint(username)
logger.debug("Using Nextcloud download endpoint: %s", endpoint)
# Use 'retrieve' method if specified
if unsafe:
logger.info("Using 'retrieve' method to download file without registry check")
cache_dir = get_cache_dir(cache_dir)
url = f"{endpoint}/{download_url}"
path = pooch.retrieve(
url, downloader=downloader, known_hash=None, path=cache_dir
)
return path
# Create Pooch fetcher
fetcher = pooch.create(
path=get_cache_dir(cache_dir),
base_url=endpoint,
registry=FILE_REGISTRY.pooch_registry,
)
# Fetch the file
try:
path = fetcher.fetch(download_url, downloader=downloader)
except Exception as e:
logger.error("Failed to download %s: %s", download_url, e)
raise
# Warn that publications cannot be made using Mojito data
logger.warning(
"WARNING! Publications using Mojito data are currently not allowed! "
"Please keep in touch, as publication policies will soon be published."
)
return path
SignalBrickType = Literal["emri", "mbhb", "sobhb", "vgb", "gb"]
"""Type of Mojito signal brick."""
NoiseBrickType = Literal["noise"]
"""Type of Mojito noise brick."""
CombinedBrickType = Literal["combined"]
"""Type of Mojito combined brick."""
BrickType = SignalBrickType | NoiseBrickType | CombinedBrickType
"""Type of Mojito data brick (signal, noise or combined)."""
[docs]
def download_brick(
brick: BrickType,
source_id: int | None = None,
*,
version: int | Literal["latest"] = "latest",
reduced: bool = False,
cache_dir: str | None = None,
username: str | None = None,
token: str | None = None,
) -> str:
"""Download a Mojito data file for the specified brick type and source ID.
Parameters
----------
brick
The type of data brick to download.
source_id
The source identifier to download data for. Only applicable for
source-specific bricks like "mbhb", "emri", or "gb".
version
The version of the brick to download. Can be an integer version number
or "latest" to download the most recent version.
reduced
Whether to download the reduced version instead of the full, official
brick. Reduced versions have downsampled data and may be missing some
quantities. They are not official L1 files and should only be used for
testing and development purposes.
cache_dir
The cache directory path, or None to use the default XDG cache
directory.
username
The Nextcloud username for authentication. If None, use the environment
variable or prompt the user. See :func:`get_credentials`.
token
The Nextcloud token/password for authentication. If None, use the
environment variable or prompt the user. See :func:`get_credentials`.
Returns
-------
Path to the downloaded data file.
"""
if reduced:
logger.warning(
"WARNING! Reduced versions are not official L1 files and are not "
"representative of real data, as sampling rates are not correct "
"and some quantities might be missing. Use them only for testing "
"and development purposes."
)
download_url = FILE_REGISTRY.get_brick_path(
brick, source_id, version=version, reduced=reduced
)
path = download_file(download_url, cache_dir, username, token)
return path
[docs]
def download_catalog(
brick: SignalBrickType,
*,
version: int | Literal["latest"] = "latest",
cache_dir: str | None = None,
username: str | None = None,
token: str | None = None,
) -> str:
"""Download a catalog for the specified brick type.
Parameters
----------
brick
The type of catalog to download.
version
The version of the catalog to download. Can be an integer version number
or "latest" to download the most recent version.
cache_dir
The cache directory path, or None to use the default XDG cache
directory.
username
The Nextcloud username for authentication. If None, use the environment
variable or prompt the user. See :func:`get_credentials`.
token
The Nextcloud token/password for authentication. If None, use the
environment variable or prompt the user. See :func:`get_credentials`.
Returns
-------
Path to the downloaded catalog file.
"""
download_url = FILE_REGISTRY.get_catalog_path(brick, version=version)
path = download_file(download_url, cache_dir, username, token)
return path
[docs]
def get_source_params(
brick: SignalBrickType,
source_id: int,
*,
version: int | Literal["latest"] = "latest",
cache_dir: str | None = None,
username: str | None = None,
token: str | None = None,
) -> dict[str, Any]:
"""Get source parameters from relevant catalog.
If needed, this function downloads the source parameter catalog for the
specified brick type; then retrieves parameters for the given source ID.
Parameters
----------
brick
The type of catalog to retrieve parameters from.
source_id
The source identifier to retreive parameters for.
version
The version of the catalog to download. Can be an integer version number
or "latest" to download the most recent version.
cache_dir
The cache directory path, or None to use the default XDG cache
directory.
username
The Nextcloud username for authentication. If None, use the environment
variable or prompt the user. See :func:`get_credentials`.
token
The Nextcloud token/password for authentication. If None, use the
environment variable or prompt the user. See :func:`get_credentials`.
Returns
-------
Source parameters for the specified source ID.
"""
params = {}
catalog_path = download_catalog(
brick,
version=version,
cache_dir=cache_dir,
username=username,
token=token,
)
with h5py.File(catalog_path, "r") as f:
binaries_group = f["Binaries"]
assert isinstance(binaries_group, h5py.Group)
for item in binaries_group:
dataset = binaries_group[item]
assert isinstance(dataset, h5py.Dataset)
if dataset.size < source_id:
raise ValueError(
"Cannot retrieve params for '{brick}' (source {source_id})"
)
params[item] = dataset[source_id]
return params