System fields¶
Overview¶
System fields are a powerful abstraction layer in InvenioRDM that provide managed access to record data through Python descriptors. They bridge the gap between the raw JSON dictionary structure of records and object-oriented Python interfaces, enabling sophisticated data management, validation, and integration with related objects.
At its core, System fields transform simple attribute access (record.title
) into complex operations that can involve:
- Data validation and transformation
- Integration with external services and databases
- Caching and performance optimization
- Record lifecycle event handling
- Denormalization and data synchronization
Architecture and core concepts¶
Every system field inherits from SystemField
, which implements Python's descriptor protocol (__get__
, __set__
, __set_name__
). This allows fields to intercept attribute access and customize behavior:
class MyRecord(Record, SystemFieldsMixin):
title = ConstantField('metadata.title', 'Default Title')
# record.title calls ConstantField.__get__
# record.title = "New Title" calls ConstantField.__set__
Mixin and Metaclass¶
The SystemFieldsMixin
uses the SystemFieldsMeta
metaclass to automatically:
- Discover all SystemField instances on the class
- Respect inheritance (fields from parent classes are included)
- Handle lifecycle events and set field names
Record Lifecycle Integration¶
SystemFields hook into the record lifecycle through the extension system:
pre_init
/post_init
: Record creation and initializationpre_dump
/post_dump
: Serialization to JSONpre_load
/post_load
: Deserialization from JSONpre_create
/post_create
: Database creationpre_commit
/post_commit
: Database transaction commitpre_delete
/post_delete
: Record deletionpre_revert
/post_revert
: Record version reversion
When to use¶
You can explore implementing a SystemField
when you need to:
- Integrate with external systems: connect record fields to databases, APIs, or services.
- Add validation logic: implement complex validation that goes beyond JSON schema.
- Manage related objects: handle relationships between records and other entities.
- Optimize performance: implement caching, denormalization, or lazy loading.
- Enforce business rules: add custom logic that must run during record operations.
- Provide clean APIs: hide complexity behind simple attribute access.
When NOT to use¶
Avoid implementing a SystemField
when:
- Simple data storage: for basic key-value storage, use the record dictionary directly.
- Performance critical paths: the descriptor overhead may impact performance.
- External processing: When data transformation happens outside the record lifecycle.
Advantages and trade-offs¶
Advantages:
- Transparent integration: fields look like normal Python attributes.
- Lifecycle hooks: automatic integration with record operations.
- Caching built-in: automatic caching mechanisms available.
Trade-offs:
- Complexity: adds abstraction layers.
- Performance overhead: descriptor calls have overhead.
- Debugging difficulty: magic behavior can be hard to trace.
- Learning curve: requires understanding of descriptors and metaclasses.
Building your own System field¶
Basic custom field example¶
from invenio_records.api import Record
from invenio_records.systemfields import SystemField, SystemFieldsMixin
class UppercaseField(SystemField):
"""A field that automatically converts values to uppercase."""
def __get__(self, record, owner=None):
if record is None:
return self # Class-level access
# instance-level access
# Get value from record dictionary
value = self.get_dictkey(record)
return value.upper() if value else None
def __set__(self, record, value):
if value is not None:
# Store lowercase in record, display uppercase
self.set_dictkey(record, value.lower(), create_if_missing=True)
# Usage in the API class
class MyRecord(Record, SystemFieldsMixin):
title = UppercaseField('metadata.title')
record = MyRecord({'metadata': {'title': 'hello world'}})
print(MyRecord.title) # <__main__.UppercaseField object at 0x7b9a3ec49fd0>
print(record.title) # "HELLO WORLD"
record.title = "New Title"
print(record['metadata']['title']) # "new title"
Advanced field with lifecycle hooks¶
from invenio_records.api import Record
from invenio_records.systemfields import SystemField, SystemFieldsMixin
class TimestampField(SystemField):
"""A field that tracks creation and modification times."""
def __init__(self, creation_key='created', modified_key='modified'):
super().__init__()
self.creation_key = creation_key
self.modified_key = modified_key
def __get__(self, record, owner=None):
if record is None:
return self
return {
'created': record.get(self.creation_key),
'modified': record.get(self.modified_key)
}
def __set__(self, record, value):
# Don't allow direct setting
raise AttributeError("Timestamps are managed automatically")
def post_init(self, record, data, model=None, field_data=None):
"""Set creation timestamp on new records."""
from datetime import datetime
now = datetime.utcnow().isoformat()
if self.creation_key not in record:
record[self.creation_key] = now
record[self.modified_key] = now
def pre_commit(self, record):
"""Update modification timestamp before saving."""
from datetime import datetime
record[self.modified_key] = datetime.utcnow().isoformat()
# Usage
class MyRecord(Record, SystemFieldsMixin):
timestamps = TimestampField()
record = MyRecord({}) # Automatically sets created/modified
# ... later modifications will update modified timestamp
Relations¶
Relations are specialized system fields that manage connections between records and other entities (other records, database models, APIs, etc.).
Basic relation concepts¶
Relations consist of several components:
- RelationsField: the main system field that manages multiple relations.
- Relation Classes: define how to resolve and validate specific relation types.
- Result Classes: handle the returned values when accessing relations.
- Mapping Class: provides the interface for managing relations on a record.
Primary key relations (PKRelation)¶
from invenio_records.systemfields.relations import RelationsField, PKRelation
class ArticleRecord(Record, SystemFieldsMixin):
relations = RelationsField(
# Connect to other ArticleRecords by ID
parent=PKRelation(
'metadata.parent',
record_cls=ArticleRecord
),
# Connect to User records
creator=PKRelation(
'metadata.creator_id',
record_cls=User
)
)
# Usage
article = ArticleRecord({'metadata': {'creator_id': 'user123'}})
creator = article.relations.creator() # Returns User record instance
article.relations.parent = parent_article # Set relation
List relations for multiple connections¶
from invenio_records.systemfields.relations import PKListRelation
class ArticleRecord(Record, SystemFieldsMixin):
relations = RelationsField(
# Multiple authors
authors=PKListRelation(
'metadata.authors',
record_cls=User
),
# Multiple subjects/tags
subjects=PKListRelation(
'metadata.subjects',
record_cls=Subject
)
)
# Usage
article.relations.authors = [user1, user2, user3] # Set multiple
for author in article.relations.authors(): # Iterate resolved records
print(author.name)
Custom relations¶
You can create custom relation types for specific integration needs:
from invenio_records.systemfields.relations import RelationBase
class APIRelation(RelationBase):
"""Relation that fetches data from an external API."""
def __init__(self, key, api_endpoint, **kwargs):
super().__init__(key, **kwargs)
self.api_endpoint = api_endpoint
def resolve(self, id_):
"""Fetch object from external API."""
if id_ in self.cache:
return self.cache[id_]
import requests
response = requests.get(f"{self.api_endpoint}/{id_}")
if response.status_code == 200:
obj = response.json()
self.cache[id_] = obj
return obj
return None
def exists(self, id_):
"""Check if the external resource exists."""
import requests
response = requests.head(f"{self.api_endpoint}/{id_}")
return response.status_code == 200
# Usage
class ArticleRecord(Record, SystemFieldsMixin):
relations = RelationsField(
external_author=APIRelation(
'metadata.external_author_id',
api_endpoint='https://api.example.com/authors'
)
)
Performance optimization strategies¶
Caching¶
System fields provide built-in caching mechanisms to avoid repeated expensive operations:
Field-level caching¶
class ExpensiveField(SystemField):
def __get__(self, record, owner=None):
if record is None:
return self
# Check if value is already cached
cached = self._get_cache(record)
if cached is not None:
return cached
# Perform expensive operation
value = self.expensive_computation(record)
# Cache the result
self._set_cache(record, value)
return value
def expensive_computation(self, record):
# Simulate expensive operation
import time
time.sleep(1)
return f"Computed value for {record.id}"
Relation caching¶
Relations automatically cache resolved objects to avoid repeated database queries:
# First access hits the database
author = article.relations.creator()
# Subsequent accesses use cached version
same_author = article.relations.creator() # No database query
Denormalization for performance¶
Denormalization stores computed or related data directly in the record for fast access:
class DenormalizedRelation(SystemField):
"""Stores related data directly in the record."""
def dereference(self, record):
"""Copy data from related records into this record."""
creator_id = record['metadata']['creator_id']
creator = User.get_record(creator_id)
# Store creator data directly in record
record.setdefault('metadata', {})['creator'] = {
'id': creator.id,
'name': creator['name'],
'email': creator['email'],
'@v': f"{creator.id}::{creator.revision_id}" # Version for staleness detection
}
def clean(self, record):
"""Remove denormalized data, keeping only the ID."""
creator_data = record.get('metadata', {}).get('creator', {})
creator_id = creator_data.get('id')
if creator_id:
record['metadata']['creator'] = {'id': creator_id}
Handling stale data¶
When using denormalization, you need strategies to handle stale data:
from elasticsearch_dsl import Search
def reindex_stale_records():
"""Find and reindex records with stale denormalized data."""
# Find records with outdated version stamps
search = Search(index='records')
# Query for records where the version doesn't match current
for hit in search.scan():
record_data = hit.to_dict()
creator_data = record_data.get('metadata', {}).get('creator', {})
version_stamp = creator_data.get('@v', '')
if version_stamp:
creator_id, revision_id = version_stamp.split('::')
current_creator = User.get_record(creator_id)
if str(current_creator.revision_id) != revision_id:
# Record has stale data, reindex it
record = MyRecord.get_record(hit.meta.id)
record.relations.creator.dereference()
record.commit()
JSON vs Python: data format considerations¶
JSON storage format¶
System fields can store data in the record's JSON dictionary when the pre_commit
hook is correctly implemented. In such case, consider how your field's data will be serialized:
class DateField(SystemField):
"""Field that handles date objects."""
def __get__(self, record, owner=None):
if record is None:
return self
date_str = self.get_dictkey(record)
if date_str:
from datetime import datetime
return datetime.fromisoformat(date_str)
return None
def __set__(self, record, value):
if value is not None:
from datetime import datetime
if isinstance(value, datetime):
# Store as ISO string in JSON
self.set_dictkey(record, value.isoformat(), create_if_missing=True)
elif isinstance(value, str):
# Validate string format
datetime.fromisoformat(value) # Raises if invalid
self.set_dictkey(record, value, create_if_missing=True)
else:
raise ValueError("Date must be datetime object or ISO string")
Handling complex objects¶
For complex objects that don't serialize naturally to JSON:
class GeolocationField(SystemField):
"""Field for geographic coordinates."""
class GeoPoint:
def __init__(self, lat, lon):
self.lat = lat
self.lon = lon
def to_dict(self):
return {'lat': self.lat, 'lon': self.lon}
@classmethod
def from_dict(cls, data):
return cls(data['lat'], data['lon'])
def __get__(self, record, owner=None):
if record is None:
return self
data = self.get_dictkey(record)
if data:
return self.GeoPoint.from_dict(data)
return None
def __set__(self, record, value):
if value is not None:
if isinstance(value, self.GeoPoint):
self.set_dictkey(record, value.to_dict(), create_if_missing=True)
else:
raise ValueError("Value must be GeoPoint instance")
Idempotence and data consistency¶
System fields are designed to be idempotent - running the same operation multiple times should produce the same result:
class IdempotentComputedField(SystemField):
"""Field that computes values idempotently."""
def __get__(self, record, owner=None):
if record is None:
return self
# Check if value is already computed and valid
computed_data = record.get('_computed', {})
if self.attr_name in computed_data:
stored_hash = computed_data[self.attr_name]['hash']
current_hash = self._compute_input_hash(record)
if stored_hash == current_hash:
# Data hasn't changed, return cached result
return computed_data[self.attr_name]['value']
# Compute new value
value = self._compute_value(record)
# Store with hash for future idempotence checks
record.setdefault('_computed', {})[self.attr_name] = {
'value': value,
'hash': self._compute_input_hash(record)
}
return value
def _compute_input_hash(self, record):
"""Hash the input data used for computation."""
# Hash relevant fields that affect the computation
relevant_data = {
'title': record.get('metadata', {}).get('title'),
'authors': record.get('metadata', {}).get('authors', [])
}
data_str = json.dumps(relevant_data, sort_keys=True)
return hashlib.md5(data_str.encode()).hexdigest()
def _compute_value(self, record):
"""Perform the actual computation."""
# Example: compute a search-friendly version of the title
title = record.get('metadata', {}).get('title', '')
return title.lower().replace(' ', '_')
Record lifecycle hooks deep dive¶
Understanding when each lifecycle hook is called is crucial for proper system field implementation:
Initialization Hooks¶
class InitializationField(SystemField):
def pre_init(self, record, data, model=None, **kwargs):
"""Called before record.__init__."""
print(f"Pre-init: {data}")
def post_init(self, record, data, model=None, field_data=None, **kwargs):
"""Called after record.__init__."""
print(f"Post-init: record={record}, field_data={field_data}")
# field_data contains any data passed for this specific field
if field_data is not None:
self.__set__(record, field_data)
Serialization hooks¶
class SerializationField(SystemField):
def pre_dump(self, record, data, dumper=None):
"""Called before record is serialized to JSON."""
# Add computed fields to the output
data['computed_title'] = record.get('title', '').upper()
def post_dump(self, record, data, dumper=None):
"""Called after record is serialized to JSON."""
# Clean up sensitive data from output
data.pop('internal_notes', None)
def pre_load(self, data, loader=None):
"""Called before data is loaded into a record."""
# Normalize incoming data
if 'title' in data:
data['title'] = data['title'].strip()
def post_load(self, record, data, loader=None):
"""Called after data is loaded into a record."""
# Validate loaded data
if not record.get('title'):
raise ValueError("Title is required")
Database hooks¶
class CustomModelField(SystemField):
def pre_create(self, record):
"""Called before record is inserted into database."""
# Set creation metadata
record['created_by'] = current_user.id
def post_create(self, record):
"""Called after record is inserted into database."""
# Trigger external notifications
self.notify_creation(record)
def pre_commit(self, record):
"""Called before database transaction is committed."""
# Final validations
self.validate_before_save(record)
# Clean denormalized data
self.clean_relations(record)
def post_commit(self, record):
"""Called after database transaction is committed."""
# Index in search engine
self.index_record(record)
# Send notifications
self.send_notifications(record)
def pre_delete(self, record):
"""Called before record is deleted."""
# Clean up related data
self.cleanup_relations(record)
def post_delete(self, record):
"""Called after record is deleted."""
# Remove from search index
self.remove_from_index(record)
Advanced patterns¶
Field composition¶
Combine multiple fields for complex behavior:
class CompositeMetadataField(SystemField):
"""Field that manages multiple related metadata fields."""
def __init__(self):
super().__init__()
self.title_field = UppercaseField('metadata.title')
self.date_field = DateField('metadata.date')
self.tags_field = PKListRelation('metadata.tags', record_cls=Tag)
def __get__(self, record, owner=None):
if record is None:
return self
return {
'title': self.title_field.__get__(record, owner),
'date': self.date_field.__get__(record, owner),
'tags': self.tags_field.get_value(record)
}
def __set__(self, record, value):
if isinstance(value, dict):
if 'title' in value:
self.title_field.__set__(record, value['title'])
if 'date' in value:
self.date_field.__set__(record, value['date'])
# Handle tags through relations field...
Validation integration¶
Integrate system fields with record validation:
class ValidatedField(SystemField):
"""Field with built-in validation."""
def __init__(self, key, validators=None):
super().__init__(key)
self.validators = validators or []
def __set__(self, record, value):
# Run validation before setting
for validator in self.validators:
validator(value)
self.set_dictkey(record, value, create_if_missing=True)
def pre_commit(self, record):
"""Validate before database commit."""
value = self.get_dictkey(record)
if value is not None:
for validator in self.validators:
validator(value)
# Usage with custom validators
def validate_email(value):
if not re.match(r'^[^@]+@[^@]+\.[^@]+$', value):
raise ValueError("Invalid email address")
class UserRecord(Record, SystemFieldsMixin):
email = ValidatedField('metadata.email', validators=[validate_email])
Error handling¶
Implement robust error handling in your fields:
import logging
class RobustField(SystemField):
"""Field with comprehensive error handling."""
def __get__(self, record, owner=None):
if record is None:
return self
try:
return self.get_dictkey(record)
except KeyError:
# Log the error but don't crash
logging.warning(f"Field {self.attr_name} not found in record {record.id}")
return None
except Exception as e:
# Handle unexpected errors gracefully
import logging
logging.error(f"Error accessing field {self.attr_name}: {e}")
return None
def __set__(self, record, value):
try:
self.set_dictkey(record, value, create_if_missing=True)
except Exception as e:
logging.error(f"Error setting field {self.attr_name}: {e}")
raise ValueError(f"Could not set {self.attr_name}: {e}")
Conclusion¶
System fields provide a powerful abstraction for managing complex record behavior in InvenioRDM. They enable:
- Clean separation between data storage and business logic
- Automatic integration with record lifecycle events
- Performance optimization through caching and denormalization
- Type safety and validation
- Integration with external systems and services
When designing system fields, consider:
- The JSON serialization format
- Performance implications of your operations
- Idempotence and data consistency
- Error handling and debugging
- Testing strategies
By following these patterns and best practices, you can create robust, maintainable system fields that enhance InvenioRDM's capabilities while maintaining clean, readable code.