File transfer¶
Two different concepts are involved in the storing of files in InvenioRDM. One is the backend, meaning the actual technology that is used to store a file. For example, the local file system or S3. The other concept is the origin , also known as method used to transport the files. There are three such defined methods.
- Local, which represents the files that are managed by the InvenioRDM instance, independently of the backend.
- Fetch, these are files that are not immediately managed by the instance as they need to be downloaded first. This means that they will eventually become local files.
- Multipart, these are files that are uploaded in parts. Users can upload parts in parallel or can retransmit each part if the upload fails, for example due to network errors. After upload, the parts are assembled into a single file and the file becomes a local file.
- Remote, these are represented by a reference to an external storage system. Since the files are not managed by the instance there is no possible way to guarantee their availability or integrity.
These types of transfer mechanisms are stored in the transfer.type
attribute of the file model, and
represented by a one character encoding:
Type | Representation |
---|---|
Local | L |
Fetch | F |
Multipart | M |
Remote | R |
Example of selecting transfer type on file creation:
POST /api/records/{id}/draft/files
Content-Type: application/json
[{
"key": "dataset.zip",
"transfer": {
"type": "F",
"url": "https://example.org/files/dataset.zip?token=<auth token>"
}
"metadata": {...}
}]
Local files (L)¶
Local files are managed as defined in the records and drafts reference section.
Files fetching (F)¶
During initialization, fetched files are created using the same protocol as local files.
Additionally you need to provide a transfer
object with type
and url
fields.
Parameters
Name | Type | Location | Description |
---|---|---|---|
type |
string | body | "F" |
url |
string | body | URL to fetch the file from |
The url
must be a URL, accessible from the server's network and resolving to a file
that can be fetched. No authentication mechanism (e.g. Authorization
header) is
supported for the request process, so any authentication has to be part of the URL itself
(e.g. a token passed in a query string).
Request
POST /api/records/{id}/draft/files HTTP/1.1
Content-Type: application/json
[
{
"key": "dataset.zip",
"transfer": {
"type": "F",
"url": "https://example.org/files/dataset.zip?token=<auth token>",
}
},
...
]
Response
HTTP/1.1 201 CREATED
Content-Type: application/json
{
"enabled": true,
"default_preview": null,
"order": [],
"entries": [
{
"key": "dataset.zip",
"updated": "2020-11-27 11:17:11.002624",
"created": "2020-11-27 11:17:10.998919",
"metadata": null,
"status": "pending",
"transfer": {
"type": "F",
},
"links": {
"content": "/api/records/{id}/draft/files/dataset.zip/content",
"self": "/api/records/{id}/draft/files/dataset.zip",
"commit": "/api/records/{id}/draft/files/dataset.zip/commit"
},
}
],
"links": {
"self": "/api/records/{id}/draft/files"
},
}
Note: The response does not contain the URL of the fetched file. This is intentional as the URL might contain sensitive information (e.g. a token) that should not be exposed to users.
At this point an asynchronous task will be launched and the file will be transported into
the InvenioRDM instance. Once the file transfer is completed, the status field will be
changed to completed
. At this point the transfer.type
of the files has also changed
to L
. The status can be checked using the files url (/api/records/{id}/draft/files
).
Note, until all the files have been transferred (i.e. their status is completed
) the
record cannot be published.
Moreover, while files are being transferred requests to the content
and commit
endpoints are not allowed (disabled).
Error handling¶
If the file fetching fails, the status of the file will be set to failed
and the error message will be stored in the transfer.error
field.
Security¶
By default file fetching will be refused. Files can only be fetched from a configurable
list of trusted domains, which can be configured in the invenio.cfg
file.
RECORDS_RESOURCES_FILES_ALLOWED_DOMAINS = [
"example.org",
"mystoragehosting.com",
]
As fetching large files from external sources can take a long time and may deplete the pool of workers, this type of file uploads are restricted to trusted users only. By default, only users with the superuser access can add this type of files.
You can change this behavior in your invenio.cfg
file:
from invenio_records_resources.services.files.generators import IfTransferType
from invenio_records_resources.services.files.transfer import FETCH_TRANSFER_TYPE
from invenio_administration.generators import Administration
class MyRepositoryPermissionPolicy(RDMRecordPermissionPolicy):
can_draft_create_files = RDMRecordPermissionPolicy.can_draft_transfer_files + [
IfTransferType(FETCH_TRANSFER_TYPE, Administration())
]
RDM_PERMISSION_POLICY = MyRepositoryPermissionPolicy
Remote files (R)¶
To link to a remote file, the transfer
section must contain the type=R
and url
fields.
Request
POST /api/records/{id}/draft/files HTTP/1.1
Content-Type: application/json
[
{
"key": "dataset.zip",
"size": 1234567,
"checksum": "md5:1234567890abcdef1234567890abcdef",
"transfer": {
"type": "R",
"url": "https://mystoragehosting.org/files/dataset.zip",
}
},
...
]
Note: The size
and checksum
fields are optional, but they are recommended to
ensure that users can verify the integrity of the downloaded file.
There is no need to call the commit
endpoint for remote files. The file is considered
committed as soon as it is created.
Accessing remote files¶
Later on, when user tries to access the file, a 302 redirect will be returned to the
url
provided in the request.
Request
GET /api/records/{id}/draft/files/dataset.zip/content HTTP/1.1
Response
HTTP/1.1 302 FOUND
Location: https://mystoragehosting.org/files/dataset.zip
Security¶
When a 302
redirect is sent to the user, they will retrieve the file directly
by following the returned URL. Therefore, you must ensure:
- Network Access: The file’s URL is reachable from the user’s network.
- No Sensitive Data: The URL does not include any sensitive information (such as tokens).
By default, InvenioRDM refuses references to external files. Files can only be referenced
from a “trusted domains” list, which you can configure in your invenio.cfg
file:
RECORDS_RESOURCES_FILES_ALLOWED_REMOTE_DOMAINS = [
"mystoragehosting.org",
]
Since the repository cannot guarantee a remote file’s availability or integrity, file uploads are also restricted to trusted users only. By default, only users with the superuser access can upload remote files.
You can change this behavior in your invenio.cfg
file:
from invenio_records_resources.services.files.generators import IfTransferType
from invenio_records_resources.services.files.transfer import REMOTE_TRANSFER_TYPE
from invenio_administration.generators import Administration
class MyRepositoryPermissionPolicy(RDMRecordPermissionPolicy):
can_draft_create_files = RDMRecordPermissionPolicy.can_draft_transfer_files + [
IfTransferType(REMOTE_TRANSFER_TYPE, Administration())
]
RDM_PERMISSION_POLICY = MyRepositoryPermissionPolicy