Usage

This plugin is intended to help data analysts more easily share and archive project data, using AWS S3 as a centralized data store.

Setup

Integrating a project with S3 involves a few steps:

  • Initialize the project
  • Check S3 to ensure a new project won’t overwrite a pre-existing project on S3 [1]
  • Update AWS configurations, as needed
  • Exclude the project’s data/ directory from version control (see Version control and data).

Initialize

To initialize:

$ cd /path/to/my-project
$ datakit data init

The data init command creates:

  • data/ - a directory where data files should be placed. This directory will be synced to the S3 bucket and path specified in the project configuration file (see below).

  • config/datakit-data.json - a JSON file with the below settings:

    {
      "aws_user_profile": "default",
      "s3_bucket": "",
      "s3_path": "my-project"
    }
    

Note

datakit-data does not currently provide safeguards against accidental overwrites of previously created projects (on S3) with an identical name. Users should always double-check the target S3 bucket to ensure that a project path has not already been used.

Configure

Project-level settings for S3 integration must be updated before data can be pushed to S3.

These configurations can be found in config/datakit-data.json:

aws_user_profile
The user profile configured in ~/.aws/credentials. The default profile is assumed by datakit-data, but this value can be modified if you have multiple profiles.
s3_bucket
Name of S3 bucket where project data should be stored. By default this is an empty string.
s3_path
The S3 bucket path to which the local data/ directory should be mapped. By default, datakit-data maps the local data/ directory to a folder named after the project’s root folder.

Default configurations

As a convenience, datakit-data provides the ability to pre-configure default settings for AWS integration. This feature helps speed up S3 integration for new projects.

Default values for the aws_user_profile and s3_bucket settings mentioned in Configure can be placed in ~/.datakit/plugins/datakit-data/config.json. These configurations will then be applied to all projects when datakit data init is run.

Example

Below is an example showing pre-configured values for the S3 bucket name and an alternative aws user profile:

# ~/.datakit/plugins/datakit-data/config.json
{
  "aws_user_profile": "other_profile",
  "s3_bucket": "my-data-projects-bucket"
}

Custom S3 paths

datakit-data provides two additional settings, only available at the global config level, to help customize the generation of the S3 path across projects.

These settings are only applied during S3 initialization. They can be overriden manually at any point by editing config/datakit-data.json for a given project.

s3_path_prefix
one or more directory levels to be prepended to a project config’s S3 path
s3_path_suffix
one or more directory levels to be appended to a project config’s S3 path

The prefix/suffix settings are useful when project data must be stored somewhere other than a project directory at the root of an S3 bucket.

For example, to store data in an S3 bucket at the following path:

projects/2017/my-project

..you would set s3_path_prefix to projects/2017. This path would then be prepended to the project’s name in the s3_path configuration whenever a new project is initialized.

Similarly, you can segregate data assets inside of a project directory on S3 by using the s3_path_suffix. For example, to store data at the below path:

my-project/data

…you would set s3_path_suffix to data/.

And of course, you can use both of these settings in tandem:

projects/2017/my-project/data

Data push/pull

Note

The below commands must be run from a directory initialized and configured for use with S3 (see Initialize for details).

Pushing and pulling data between your local machine and the S3 data store requires two commands:

$ datakit data push
$ datakit data pull

The above commands provide a human-friendly interface to the AWS S3 sync commmand line utility.

The sync utility writes all files in a project’s local data/ directory (and its subdirectories) to the S3 bucket and path specified in config/datakit-data.json, or vice versa.

By default, this command does not delete previously written files in a target location if they have been removed in the source location.

This functionality is available, however, via the –delete flag of the underlying AWS S3 sync utility. datakit-data provides access to the –delete flag and a limited set of other options provided by the sync command (see Extra flags).

Extra flags

While datakit-data is intended to simplify and standardize working with S3 as a data store, it also exposes a subset of more advanced options for the underlying AWS S3 sync utility.

Users can pass any boolean flag supported by S3 sync to the plugin’s push or pull commands.

Boolean flags are those that do not accept values (i.e. simply calling them toggles a behavior on or off).

The flags must be passed to datakit as additional paramaters without leading dashes [2]

For example, to delete files on S3 that are not present locally:

$ datakit data push delete

To view which files will be affected before pushing data to S3:

$ datakit data push dryrun

or

$ datakit data push delete dryrun

Please refer to the AWS S3 sync documentation for details on other boolean flags.

Version control and data

This plugin expects data files associated with a project to live in a data/ directory at the root of a project folder. This is typically the root of a code repository.

While code to acquire, clean and analyze data should be placed under version control, the data/ directory itself should be excluded from version control.

Note

Version control systems have different mechanisms to prevent files from being “tracked”. Git users, for instance, should add the data/ directory to a project’s .gitignore file.

[1]datakit-data does not currently guard against overwrites of pre-existing projects of the same name.
[2]Leading slashes must be dropped to enable datakit to differentiate between its own flags and those intended for pass-through to the underlying AWS S3 sync utility.