Data Resource
Author(s) | Paul Walsh, Rufus Pollock |
---|---|
Profile | data-resource.json |
A simple format to describe and package a single data resource such as a individual table or file.
Language
The key words MUST
, MUST NOT
, REQUIRED
, SHALL
, SHALL NOT
, SHOULD
, SHOULD NOT
, RECOMMENDED
, MAY
, and OPTIONAL
in this document are to be interpreted as described in RFC 2119
Introduction
The Data Resource format describes a data resource such as an individual file or table. The essence of a Data Resource is a locator for the data it describes. A range of other properties can be declared to provide a richer set of metadata.
Examples
A minimal Data Resource looks as follows:
With data accessible via the local filesystem.
With data accessible via http.
A minimal Data Resource pointing to some inline data looks as follows.
A comprehensive Data Resource example with all required, recommended and optional properties looks as follows.
Descriptor
A Data Resource descriptor MUST
be a valid JSON object
. (JSON is defined in RFC 4627).
Key properties of the descriptor are described below. A descriptor MAY
include any number of properties in additional to those described below as required and optional properties.
Data Location
A resource MUST
contain a property describing the location of the
data associated to the resource. The location of resource data MUST
be
specified by the presence of one (and only one) of these two properties:
path
: for data in files located online or locally on disk.data
: for data inline in the descriptor itself.
path
Data in Files
path
MUST
be a string — or an array of strings (see “Data in Multiple
Files”). Each string MUST
be a “url-or-path” as defined in the next section.
URL or Path
A “url-or-path” is a string
with the following additional constraints:
MUST
either be a URL or a POSIX path- URLs
MUST
be fully qualified.MUST
be using either http or https scheme. (Absence of a scheme indicatesMUST
be a POSIX path) - POSIX paths (unix-style with
/
as separator) are supported for referencing local files, with the security restraint that theyMUST
be relative siblings or children of the descriptor. Absolute paths/
, relative parent paths../
, hidden folders starting from a dot.hidden
MUST
NOT be used.
Examples:
/
(absolute path) and ../
(relative parent path) are forbidden to avoid security vulnerabilities when implementing data package software. These limitations on resource path
ensure that resource paths only point to files within the data package directory and its subdirectories. This prevents data package software being exploited by a malicious user to gain unintended access to sensitive information.
For example, suppose a data package hosting service stores packages on disk and allows access via an API. A malicious user uploads a data package with a resource path like /etc/passwd
. The user then requests the data for that resource and the server naively opens /etc/passwd
and returns that data to the caller.
Prior to release 1.0.0-beta.18 (Nov 17 2016) there was a url
property distinct from path
. In order to support backwards compatibility, implementors MAY
want to automatically convert a url
property to a path
property and issue a warning.
Data in Multiple Files
Usually, a resource will have only a single file associated to it. However, sometimes it can be convenient to have a single resource whose data is split across multiple files — perhaps the data is large and having it in one file would be inconvenient.
To support this use case the path
property MAY
be an array of strings rather
than a single string:
It is NOT permitted to mix fully qualified URLs and relative paths in a path
array: strings `MUST either all be relative paths or all URLs.
NOTE: All files in the array MUST
be similar in terms of structure, format etc. Implementors MUST
be able to concatenate together the files in the simplest way and treat the result as one large file. For tabular data there is the issue of header rows. See the Tabular Data Package spec for more on this.
data
Inline Data
Resource data rather than being stored in external files can be shipped inline
on a Resource using the data
property.
The value of the data property can be any type of data. However, restrictions of JSON require that the value be a string so for binary data you will need to encode (e.g. to Base64). Information on the type and encoding of the value of the data property SHOULD be provided by the format (or mediatype) property and the encoding property.
Specifically: the value of the data property MUST
be:
- EITHER: a JSON array or Object- the data is then assumed to be JSON data and SHOULD be processed as such
- OR: a JSON string - in this case the format or mediatype properties
MUST
be provided.
Thus, a consumer of resource object MAY
assume if no format or mediatype property is provided that the data is JSON and attempt to process it as such.
Examples 1 - inline JSON:
Example 2 - inline CSV:
Metadata Properties
Required Properties
A descriptor MUST
contain the following properties:
name
A resource MUST
contain a name
property. The name is a simple name or identifier to be used for this resource.
- It
MUST
be unique amongst all resources in this data package. - It
SHOULD
be human-readable and consist only of lowercase alphanumeric characters plus ”.”, ”-” and ”_“. - It would be usual for the name to correspond to the file name (minus the extension) of the data file the resource describes.
Recommended Properties
profile
A string identifying the profile of this descriptor as per the profiles specification.
Examples:
Optional Properties
A descriptor MAY
contain any number of additional properties. Common properties include:
-
title
: a title or label for the resource. -
description
: a description of the resource. -
format
: ‘csv’, ‘xls’, ‘json’ etc. Would be expected to be the standard file extension for this type of resource. -
mediatype
: the mediatype/mimetype of the resource e.g. “text/csv”, or “application/vnd.ms-excel”. Mediatypes are maintained by the Internet Assigned Numbers Authority (IANA) in a media type registry. -
encoding
: the character encoding of resource’s data file (only applicable for textual files). The valueSHOULD
be one of the “Preferred MIME Names” for a character encoding registered with IANA. If no value for this property is specified then the encodingSHOULD
be detected on the implementation level. It isRECOMMENDED
to use UTF-8 (without BOM) as a default encoding for textual files. -
bytes
: size of the file in bytes. -
hash
: the MD5 hash for this resource. Other algorithms can be indicated by prefixing the hash’s value with the algorithm name in lower-case. For example: -
sources
: as for Data Package metadata. -
licenses
: as for Data Package metadata. If not specified the resource inherits from the data package.
Resource Schemas
A Data Resource MAY
have a schema
property to describe the schema of the resource data.
The value for the schema
property on a resource
MUST be an object
representing the schema OR a string
that identifies the location of the schema.
If a string
it must be a url-or-path as defined above, that is a fully qualified http URL or a relative POSIX path. The file at the location specified by this url-or-path string MUST
be a JSON document containing the schema.
NOTE: the Data Package specification places no restrictions on the form of the schema Object. This flexibility enables specific communities to define schemas appropriate for the data they manage. As an example, the Tabular Data Package specification requires the schema to conform to Table Schema.