intake sqlite

An Intake driver that provides access to SQLite databases identified by either local or remote URLs.

REPO IS ARCHIVED

After experimenting for a while, we ultimately decided not to use Intake catalogs to
distribute our data, so this repository is no longer maintained.

See PUDL Data Access <https://catalystcoop-pudl.readthedocs.io/en/nightly/data_access.html>__
for more information on how to access the data we publish.

SQLite Driver for Intake Data Catalogs

… readme-intro

… image:: https://github.com/catalyst-cooperative/intake-sqlite/workflows/tox-pytest/badge.svg
:target: https://github.com/catalyst-cooperative/intake-sqlite/actions?query=workflow%3Atox-pytest
:alt: Tox-PyTest Status

… image:: https://img.shields.io/codecov/c/github/catalyst-cooperative/intake-sqlite?style=flat&logo=codecov
:target: https://codecov.io/gh/catalyst-cooperative/intake-sqlite
:alt: Codecov Test Coverage

… image:: https://img.shields.io/readthedocs/intake-sqlite?style=flat&logo=readthedocs
:target: https://intake-sqlite.readthedocs.io/en/latest/
:alt: Read the Docs Build Status

… image:: https://img.shields.io/pypi/v/intake-sqlite?style=flat&logo=python
:target: https://pypi.org/project/intake-sqlite
:alt: PyPI Latest Version

… image:: https://img.shields.io/conda/vn/conda-forge/intake-sqlite?style=flat&logo=condaforge
:target: https://anaconda.org/conda-forge/intake-sqlite
:alt: conda-forge Version

… image:: https://img.shields.io/pypi/pyversions/intake-sqlite?style=flat&logo=python
:target: https://pypi.org/project/intake-sqlite
:alt: Supported Python Versions

… image:: https://img.shields.io/badge/code style-black-000000.svg
:target: https://github.com/psf/black>
:alt: Any color you want, so long as it’s black.

This package provides a (very) thin wrapper around the more general intake-sql <https://github.com/intake/intake-sql>__ driver, which can be used to generate Intake data catalogs <https://github.com/intake/intake>__ from SQL databases.

The intake-sql driver takes an SQL Alchemy database URL <https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls>__ and uses it to
connect to and extract data from the database. This works with just fine with
SQLite databases <https://www.sqlite.org/index.html>__, but only when the database
file is stored locally and can be referenced with a simple path.

For example this path::

/home/zane/code/catalyst/pudl-work/sqlite/pudl.sqlite

would correspond to this SQL Alchemy database URL::

sqlite:///home/zane/code/catalyst/pudl-work/sqlite/pudl.sqlite

But you can’t access a remote SQLite DB this way.

Why access a remote SQLite DB?

  • SQLite databases are great standalone, standardized containers for relational data,
    that can be accessed using a huge variety of tools on almost any computer platform.
    They are even accepted as an archival format by the US Library of Congress!
  • Data evolves over time, and it’s often useful to have easy access to several
    different versions of it, and to know exactly which version you’re working with.
  • Cloud object storage is extremely cheap and convenient, and makes it easy to
    publish and store historical file-based data releases.
  • Managing your own bespoke local filesystem hierarchy filled with data – and
    coordinating with colleagues so that everyone is using the same filesystem
    organizational scheme – is a pain.
  • Intake catalogs can provide easy access to metadata and let you manage data versions
    just like software versions. Installing a new version of the data catalog points you
    at the new version of the data.
  • The overhead and cost associated with setting up and maintaining a database that uses
    a client-server model is relatively large compared to distributing a few files that
    change infrequently, are essentially read-only resources, and only take up a few
    gigabytes of space.

How does it work?

Rather than using an SQL Alchemy database URL to reference the SQLite DB, this intake
driver takes a local path or a remote URL, like:

  • ../pudl-work/sqlite/pudl.sqlite
  • https://global-power-plants.datasettes.com/global-power-plants.db
  • s3://cloudy-mc-cloudface-databucket/v1.2.3/mydata.db

For local paths, it resolves the path and prepends sqlite:// before handing it off
to intake-sql to do all the hard work.

For remote URLs it uses fsspec <https://filesystem-spec.readthedocs.io/en/latest/>__
to cache a local copy <https://filesystem-spec.readthedocs.io/en/latest/features.html?highlight=simplecache#caching-files-locally>__
of the database, and then gives intake-sql a database URL that points to the cached
copy.

… code:: python

import intake_sqlite

gpp_cat = intake_sqlite.SQLiteCatalog(
urlpath=“https://global-power-plants.datasettes.com/global-power-plants.db”,
storage_options={“simplecache”: {“cache_storage”: “/home/zane/.cache/intake”}},
)

list(gpp_cat)

[‘global-power-plants’,

‘global-power-plants_fts’,

‘global-power-plants_fts_config’,

‘global-power-plants_fts_data’,

‘global-power-plants_fts_docsize’,

‘global-power-plants_fts_idx’]

About Catalyst Cooperative

Catalyst Cooperative <https://catalyst.coop>__ is a small group of data
wranglers and policy wonks organized as a worker-owned cooperative consultancy.
Our goal is a more just, livable, and sustainable world. We integrate public
data and perform custom analyses to inform public policy (Hire us! <https://catalyst.coop/hire-catalyst>__). Our focus is primarily on mitigating
climate change and improving electric utility regulation in the United States.

Contact Us

  • For general support, questions, or other conversations around the project
    that might be of interest to others, check out the
    GitHub Discussions <https://github.com/catalyst-cooperative/pudl/discussions>__
  • If you’d like to get occasional updates about our projects
    sign up for our email list <https://catalyst.coop/updates/>__.
  • Want to schedule a time to chat with us one-on-one? Join us for
    Office Hours <https://calend.ly/catalyst-cooperative/pudl-office-hours>__
  • Follow us on Twitter: @CatalystCoop <https://twitter.com/CatalystCoop>__
  • More info on our website: https://catalyst.coop
  • For private communication about the project or to hire us to provide customized data
    extraction and analysis, you can email the maintainers:
    [email protected] <mailto:[email protected]>__