Example schema and dataset files

Address book of the CEOs of technology companies

The following illustrates a schema for an address book of people, the companies that they work for, and their addresses, and an example address book of the CEOs of several technology companies.

Address books are represented using three classes to represent people (Person), the companies that they work for (Company), and their addresses (Address).

The many-to-one relationship between companies and their employees (Person.company) and the one-to-one relationships between company and people and their addresses (Company.address, Person.address) are represented by ManyToOne and OneToOne attributes. The required argument related_name defines the name of the reverse relationship from the related class (Company.employees, Address.company, Address.person). This makes each relationship bidirectional.

Address books are encoded into two worksheets for companies and their employees, with groups of columns that represent the address of each company and employee. Encoding companies and their addresses and people and their addresses into the same worksheet, rather than encoding addresses into a separate worksheet, makes address books easier for human to read.

The ObjTables software provides an object-relational mapping system (ORM) that can automatically map between the object-oriented and spreadsheet representations. This makes it easy for authors to create datasets with high-level data structures and export them to human-readable spreadsheets, and this makes it easy for other investigators to parse human-readable spreadsheets into convenient data structures for comparison and re-analysis. Similar to other ORMs, ObjTables makes it easy to define constraints on the value of each attribute (e.g., an attribute that represents molecular weights must be positive), ObjTables makes it easy to implement custom validations of each object and entire datasets, and ObjTables automatically generates APIs for constructing objects and querying their relationships. See below for more information.

Tabular-formatted schema

A more complete example is available in TSV format from GitHub .

Worksheet: !!_Schema
!!ObjTables type='Schema' tableFormat='row' description='Table/model and column/attribute definitions' date='2020-03-10 21:34:50' objTablesVersion='0.0.8'
!Name!Type!Parent!Format!Verbose name!Verbose name plural!Description
 
CompanyClassrowCompanyCompanies
nameAttributeCompanyString(primary=True, unique=True)Name
urlAttributeCompanyUrlURL
addressAttributeCompanyOneToOne('Address', related_name='company')Address
 
PersonClassrowPersonPeople
nameAttributePersonString(primary=True, unique=True)Name
typeAttributePersonEnum(['family', 'friend', 'business'])Type
companyAttributePersonManyToOne('Company', related_name='employees')Company
email_addressAttributePersonEmailEmail address
phone_numberAttributePersonStringPhone number
addressAttributePersonOneToOne('Address', related_name='person')Address
 
AddressClassmultiple_cellsAddressAddresses
streetAttributeAddressString(primary=True, unique=True)Street
cityAttributeAddressStringCity
stateAttributeAddressStringState
zip_codeAttributeAddressStringZip code
countryAttributeAddressStringCountry

Schema defined as a Python module

A more complete example is available in Python format from GitHub .

import obj_tables

class Address(obj_tables.Model):
    street = obj_tables.StringAttribute(
        primary=True,
        unique=True,
        verbose_name='Street')
    city = obj_tables.StringAttribute(
        verbose_name='City')
    state = obj_tables.StringAttribute(
        verbose_name='State')
    zip_code = obj_tables.StringAttribute(
        verbose_name='Zip code')
    country = obj_tables.StringAttribute(
        verbose_name='Country')

    class Meta(obj_tables.Model.Meta):
        table_format = \
            obj_tables.TableFormat \
                .multiple_cells
        attribute_order = (
            'street',
            'city',
            'state',
            'zip_code',
            'country',
        )
        verbose_name = 'Address'
        verbose_name_plural = 'Addresses'

class PersonType(enum.Enum):
    family = 'family'
    friend = 'friend'
    business = 'business'

class Person(obj_tables.Model):
    name = obj_tables.StringAttribute(
        primary=True,
        unique=True,
        verbose_name='Name')
    type = obj_tables.EnumAttribute(
        PersonType,
        verbose_name='Type')
    company = obj_tables.ManyToOneAttribute(
        'Company',
        related_name='employees',
        verbose_name='Company')
    email_address = obj_tables.EmailAttribute(
        verbose_name='Email address')
    phone_number = obj_tables.StringAttribute(
        verbose_name='Phone number')
    address = obj_tables.OneToOneAttribute(
        'Address',
        related_name='person',
        verbose_name='Address')

    class Meta(obj_tables.Model.Meta):
        table_format = \
            obj_tables.TableFormat.row
        attribute_order = (
            'name',
            'type',
            'company',
            'email_address',
            'phone_number',
            'address',
        )
        verbose_name = 'Person'
        verbose_name_plural = 'People'


class Company(obj_tables.Model):
    name = obj_tables.StringAttribute(
        primary=True,
        unique=True,
        verbose_name='Name')
    url = obj_tables.UrlAttribute(
        verbose_name='URL')
    address = obj_tables.OneToOneAttribute(
        'Address',
        related_name='company',
        verbose_name='Address')

    class Meta(obj_tables.Model.Meta):
        table_format = \
            obj_tables.TableFormat.row
        attribute_order = (
            'name',
            'url',
            'address',
        )
        verbose_name = 'Company'
        verbose_name_plural = 'Companies'

Tabular-formatted dataset

The following example illustrates an address book of the CEOs of several major technology companies, the companies that they lead, and their addresses. A more complete example is available in merged-TSV format from GitHub .

Worksheet: !!_Table of contents
!!!ObjTables objTablesVersion='0.0.8' date='2020-03-14 13:19:04'
!!ObjTables type='TableOfContents'
!Table !Description !Number of objects
Schema Table/model and column/attribute definitions
Companies 4
People 4

Worksheet: !!Company
!!ObjTables type='Data' tableFormat='row' class='Company' name='Companies' date='2020-03-14 13:19:04' objTablesVersion='0.0.8'
!Address
!Name!URL!Street!City!State!Zip code!Country
Apple https://www.apple.com/ 10600 N Tantau Ave Cupertino CA 95014 US
Facebook https://www.facebook.com/ 1 Hacker Way #15 Menlo Park CA 94025 US
Google https://www.google.com/ 1600 Amphitheatre Pkwy Mountain View CA 94043 US
Netflix https://www.netflix.com/ 100 Winchester Cir Los Gatos CA 95032 US

Worksheet: !!People
!!ObjTables type='Data' tableFormat='row' class='Person' name='People' date='2020-03-14 13:19:04' objTablesVersion='0.0.8'
!Address
!Name!Type!Company!Email address!Phone number!Street!City!State!Zip code!Country
Mark ZuckerbergfamilyFacebookzuck@fb.com650-543-48001 Hacker Way #15Menlo ParkCA94025US
Reed HastingsbusinessNetflixreed.hastings@netflix.com408-540-3700100 Winchester CirLos GatosCA95032US
Sundar PichaibusinessGooglesundar@google.com650-253-00001600 Amphitheatre PkwyMountain ViewCA94043US
Tim CookbusinessAppletcook@apple.com408-996-101010600 N Tantau AveCupertinoCA95014US

Download additional examples

Below are several examples of ObjTables schemas and datasets in all of the supported formats.

Description Schema UML diagram Dataset Schema & dataset
Address book (more complete example with a transposed table) Schema: tsv, xlsx, py UML diagram: svg Dataset: tsv, multi.tsv, xlsx, json, yml Schema & dataset: multi.tsv, xlsx
Finance: Financial transactions Schema: tsv, xlsx, py UML diagram: svg Dataset: tsv, multi.tsv, xlsx, json, yml Schema & dataset: multi.tsv, xlsx
Genomics: Genes and transcript variants Schema: tsv, xlsx, py UML diagram: svg Dataset: tsv, multi.tsv, xlsx, json, yml Schema & dataset: multi.tsv, xlsx
Models of E. coli metabolism discussed in the supplement to the ObjTables paper
Khodayari & Maranas kinetic model Schema: tsv, xlsx, py UML diagram: svg Dataset: tsv, multi.tsv, xlsx, json, yml Schema & dataset: multi.tsv, xlsx
Gerosa thermodynamic model Schema: tsv, xlsx, py UML diagram: svg Dataset: tsv, multi.tsv, xlsx, json, yml Schema & dataset: multi.tsv, xlsx
Code for merging the models
Systems biology: Network of metabolites and biochemical reactions Schema: tsv, xlsx, py UML diagram: svg Dataset: tsv, multi.tsv, xlsx, json, yml Schema & dataset: multi.tsv, xlsx
Systems biology: SBtab – systems biology data and kinetic models Schema: tsv, xlsx, py UML diagram: svg
Hynne model of yeast glycolysis Dataset: tsv, multi.tsv, xlsx, json, yml
Jiang et al. model of pancreatic beta-cell insulin secretion Dataset: tsv, multi.tsv, xlsx, json, yml
Related data Dataset: tsv, multi.tsv, xlsx, json, yml
Noor et al. model of Escherichia coli metabolism Dataset: tsv, multi.tsv, xlsx, json, yml
Related data Dataset: tsv, multi.tsv, xlsx, json, yml
Sigurdsson et al. model of mouse metabolism Dataset: tsv, multi.tsv, xlsx, json, yml
Teusink model of yeast glycolysis Dataset: tsv, multi.tsv, xlsx, json, yml
Related data Dataset: tsv, multi.tsv, xlsx, json, yml
Wortel et al. model of Escherichia coli metabolism Dataset: tsv, multi.tsv, xlsx, json, yml
Related data Dataset: tsv, multi.tsv, xlsx, json, yml
Standard free energies of reactions calculated by eQuilibrator Dataset: tsv, multi.tsv, xlsx, json, yml
Yeast transcriptional regulatory network inferred by Chang et al. Dataset: tsv, multi.tsv, xlsx, json, yml
Systems biology: WC-KB: format for datasets for eukaryotic whole-cell models Schema: py UML diagram: svg
Example knowledge base Dataset: xlsx
Systems biology: WC-Lang: language for composite, hybrid whole-cell models Schema: py UML diagram: svg
Example model Dataset: xlsx

Formats for schemas

ObjTables schemas can either be defined using the tabular format described below or using the ObjTables Python package. The tabular format is easy to use, and requires no programming. The tabular format supports most of the features of ObjTables, including a wide range of data types, relationships, transposed and embedded tables, single inheritance, and basic validation.

The ObjTables Python package is more flexible. In addition to the features of the tabular schema format, the Python package supports abstract classes and methods, custom data types, custom validation, custom XSLX formatting, and more. The Python package can also be used to encapsulate methods for manipulating data inside schemas.

We recommend that developers begin with the tabular format, and transition to the Python format when more capabilities are needed. When more capabilities are required, we recommend that developers use the ObjTables software tools to convert their initial tabular-formatted schema into a Python module. This workflow provides a quick path to developing a custom schema.

Tabular format for schemas

Tabular-formatted schemas should begin with a single header row, which indicates that the schema is encoded in ObjTables format (!!ObjTables type="Schema" ...).

Optionally, the header can use the name attribute to define the name of the schema (e.g., !!ObjTables type="Schema" name="<schema-name>" ...). If worksheets also define the schema that they belong to, ObjTables will check that these values are equal.

After the header row, the schema should contain a table with the columns below that defines the classes/models/tables and their attributes/columns. Each row in the table should define a single class or attribute.

  • !Name: Name of the component (class or attribute).
    • Classes: A string that begins with a letter or underscore, and consists of letters, numbers, and underscores.
    • Attributes: A string that begins with a letter or underscore, and consists of letters, numbers, underscores, colons, forward carets, dots, dashes, square brackets, and spaces.
  • !Type: Type of the component (Class or Attribute).
  • !Parent:
    • Classes: Empty, or the name of the parent class. Because this is implemented using Python class inheritance, this must specify an acyclic inheritance graph.
    • Attributes: Name of the class that the attribute belongs to.
  • !Format:
    • Classes:
      • row: Encode the instances of the class as rows.
      • column: Encode the instances of the class as columns (i.e., transposed table).
      • multiple_cells: Encode the instances of the class within a group of columns in the tables of the one-to-many and one-to-one related classes.
      • cell: Encode the instances of the class within columns of the related classes, optionally, using a grammar. See the Python documentation for more information about working with grammars.
    • Attributes: One of the data types listed below (e.g., String, Float). Arguments for the data types should be described in parentheses (e.g., String(min_length=5). These arguments enable users to customize how data types function and are validated. See the Python documentation for more information about the attribute types and their arguments.
  • !Verbose name (Optional): Verbose name of the component.
  • !Verbose name plural (Optional): Plural verbose name of the component.
  • !Description (Optional): Description of the component.

Tabular-formatted schemas can be saved in comma-separated (.csv), tab-separated (.tsv), or XSLX (.xlsx) format.

Implementing schemas in Python using the Python package

Schemas can also be implemented using the ObjTables Python package. The API enables developers to define schemas with a similar syntax to object-relational mapping tools such as Django , Ruby on Rails , and SQLAlchemy .

  • The name of the schema is given by the name of the Python module which encapsulates the schema. Typically, this the filename of the module.
  • Each class should be implemented as a subclass of obj_tables.Model.
  • Each attribute of each class should be implemented as a class attribute whose value is an instance of a subclass of obj_tables.Attribute. Numerous subclasses of obj_tables.Attribute, which support a wide range of data types, are available. Please see below and the Python documentation for more information, including the arguments of these classes.
    • Each relationship between two classes should be implemented using the obj_tables.OneToOneAttribute, obj_tables.OneToManyAttribute, obj_tables.ManyToOneAttribute, or obj_tables.ManyToManyAttribute classes.
  • The meta-information about each class, such as its verbose name, should be implemented as a class attribute with the name Meta that is a subclass of obj_tables.Model.Meta. This class can have the following class attributes:
    • unique_together: Tuple of tuples of the names of attributes whose values must be unique.
    • verbose_name: Verbose name of the class.
    • verbose_name_plural: Plural verbose name of the class.
    • description: Description of the class.
    • table_format: Indicates how the class should be formatted: obj_tables.TableFormat.row (normal table), obj_tables.TableFormat.column (transposed table), obj_tables.TableFormat.multiple_cells (inline as a set of columns/rows within the tables of related classes), or obj_tables.TableFormat.cell (inline encoded into a column or row of the tables of the related classes).
    • attribute_order: List of the names of the attributes of the class in the order they should be printed in its table.
    • ordering: Defines the default sorting for the instances of the class. This should be a list of strings of the names of attributes, in the order in which the values of the attributes should be sorted. Optionally, the prefix - can be used to indicate that instances should be sorted in reverse order of the values of the attribute.
    • frozen_columns: Number of columns to freeze when the class is printed to an XLSX worksheet.
    • merge: Indicates how semantically-equivalent instances of the class should be merged when two datasets are merged: obj_tables.ModelMerge.join (concatenate their *-to-many relationships) or obj_tables.ModelMerge.append (throw an error if datasets contain semantically equivalent instances).

Please see the Python documentation for more information.

Encoding multiple classes into a single table

To encode complex datasets into a minimal number of tables, related classes can be encoded into a column or a series of columns of the table that represents their (row-formatted) parent classes. (For column-formatted parent classes, related classes can be encoded into a row or a series of rows.)

  • Classes related by *-to-one attributes:
    • For row-formatted classes, classes related by *-to-one attributes can be encoded into a series of consecutive columns by setting the formats of the related classes to multiple_cells in the schema. When this feature is used, tables must include an additional row of column headings which indicate the groups of columns.
    • For column-formatted classes, classes related by *-to-one attributes can be encoded into a series of consecutive rows by setting the formats of the related classes to multiple_cells in the schema. When this feature is used, tables must include an additional column of row headings which indicate the groups of columns.
  • Classes related by a ToManyGrammar attribute:
    • For row-formatted classes, such related classes can be encoded into a column by setting the formats of the related classes to cell in the schema.
    • For column-formatted classes, such related classes can be encoded into a row by setting the formats of the related classes to cell in the schema.

Data types

ObjTables supports numerous data types, and it is easy to implement additional types.

This includes powerful data types from several mathematics and science packages:

  • BioPython : sequence informatics
  • lark : grammars for describing complex relationships
  • numpy : numeric arrays
  • pandas : data tables
  • pint : units
  • sympy : symbolic math
  • uncertainties : values and their uncertainty

The following table lists the attribute types currently available in ObjTables. The first column indicates the names that should be used in conjunction with the tabular schema format. The second column indicates the Python class which implements each type, and which should be used to define schemas with the Python package. The third column indicates the data type used to encode each attribute into an spreadsheet cell. The fourth column indicates the Python data type used to represent each attribute in the ObjTables software tools.

Format Python class XLSX type Python type
Fundamental data types
Boolean Python class: obj_tables.BooleanAttribute XLSX type: Boolean Python type: bool
Boolean
Enum Python class: obj_tables.EnumAttribute XLSX type: String Python type: enum.Enum
Enumeration
Integer Python class: obj_tables.IntegerAttribute XLSX type: Number Python type: int
Integer
PositiveInteger Python class: obj_tables.PositiveIntegerAttribute XLSX type: Number Python type: int
Positive integer
Float Python class: obj_tables.FloatAttribute XLSX type: Number Python type: float
Float
PositiveFloat Python class: obj_tables.PositiveFloatAttribute XLSX type: Number Python type: float
Positive float
String Python class: obj_tables.StringAttribute XLSX type: String Python type: str
Short string
LongString Python class: obj_tables.LongStringAttribute XLSX type: String Python type: str
Long string
Regex Python class: obj_tables.RegexAttribute XLSX type: String Python type: str
String that matches a regular expression
Email Python class: obj_tables.EmailAttribute XLSX type: String Python type: str
Email address
LocalPath Python class: obj_tables.LocalPathAttribute XLSX type: String Python type: pathlib.Path
Path to a local file or directory
Url Python class: obj_tables.UrlAttribute XLSX type: String Python type: str
Uniform resource locater (URL)
Date Python class: obj_tables.DateAttribute XLSX type: String Python type: datetime.date
Date
DateTime Python class: obj_tables.DateTimeAttribute XLSX type: String Python type: datetime.datetime
Date and time
Time Python class: obj_tables.TimeAttribute XLSX type: String Python type: datetime.time
Time
Range Python class: obj_tables.RangeAttribute XLSX type: String Python type: obj_tables.Range
Range of values (e.g., 10-100)
List Python class: obj_tables.ListAttribute XLSX type: String Python type: list
List of values
Relationships
OneToOne Python class: obj_tables.OneToOneAttribute XLSX type: String Python type: obj_tables.Model
One-to-one relationship
OneToMany Python class: obj_tables.OneToManyAttribute XLSX type: String Python type: list of obj_tables.Model
One-to-many relationship
ManyToOne Python class: obj_tables.ManyToOneAttribute XLSX type: String Python type: obj_tables.Model
Many-to-one relationship
ManyToMany Python class: obj_tables.ManyToManyAttribute XLSX type: String Python type: list of obj_tables.Model
Many-to-many relationship
Grammars for domain-specific languages for encoding data and relationships into an individual cell in a table
grammar.ToManyGrammar Python class: obj_tables.grammar.ToManyGrammarAttribute XLSX type: String Python type: list of obj_tables.Model
*-to-many relationship serialized/deserialized using a custom grammar.
Mathematics
math.Array Python class: obj_tables.math.ArrayAttribute XLSX type: String Python type: numpy.ndarray
Numerical array or matrix
math.Table Python class: obj_tables.math.TableAttribute XLSX type: String Python type: pandas.DataFrame
Data table, optionally with row and column labels
math.ManyToOneExpression Python class: obj_tables.math.ManyToOneExpressionAttribute XLSX type: String Python type: obj_tables.math.expression.Expression
Numerical array or matrix
math.OneToOneExpression Python class: obj_tables.math.OneToOneExpressionAttribute XLSX type: String Python type: obj_tables.math.expression.Expression
Numerical array or matrix
math.SymbolicExpr Python class: obj_tables.math.SymbolicExprAttribute XLSX type: String Python type: sympy.Expr
Symbolic expression
math.SymbolicSymbol Python class: obj_tables.math.SymbolicSymbolAttribute XLSX type: String Python type: sympy.Symbol
Symbolic symbol
Science
sci.Unit Python class: obj_tables.sci.UnitAttribute XLSX type: String Python type: pint.unit._Unit
Units
sci.Quantity Python class: obj_tables.sci.QuantityAttribute XLSX type: String Python type: pint.quantity._Quantity
Magnitude and units
sci.UncertainFloat Python class: obj_tables.sci.UncertainFloatAttribute XLSX type: String Python type: uncertainties.core.Variable
Float and its uncertainty
sci.Doi Python class: obj_tables.sci.DoiAttribute XLSX type: String Python type: str
Digital Object Identifier
sci.Dois Python class: obj_tables.sci.DoisAttribute XLSX type: String Python type: list of str
List of Digital Object Identifiers
sci.Identifier Python class: obj_tables.sci.IdentifierAttribute XLSX type: String Python type: obj_tables.sci.Identifer
An identifier in a namespace registered with Identifiers.org
sci.Identifiers Python class: obj_tables.sci.IdentifiersAttribute XLSX type: String Python type: list of obj_tables.sci.Identifer
List of identifiers in namespaces registered with Identifiers.org
sci.OntoTerm Python class: obj_tables.sci.OntoTermAttribute XLSX type: String Python type: pronto.Term
Term in an ontology
sci.PubMedId Python class: obj_tables.sci.PubMedIdAttribute XLSX type: Number Python type: int
PubMed identifier
sci.PubMedIds Python class: obj_tables.sci.PubMedIdsAttribute XLSX type: String Python type: list of int
List of PubMed identifiers
Chemistry
chem.ChemicalStructure Python class: obj_tables.chem.ChemicalStructureAttribute XLSX type: String Python type: wc_utils.utils.chem.Structure
Chemical structure (openbabel.OBMol , bpforms.BpForm , bcforms.BcForm )
chem.ChemicalFormula Python class: obj_tables.chem.ChemicalFormulaAttribute XLSX type: String Python type: wc_utils.utils.chem.EmpiricalFormula
Chemical formula
chem.ReactionEquation Python class: obj_tables.chem.ReactionEquationAttribute XLSX type: String Python type: obj_tables.chem.ReactionEquation
Reaction equation
Biology
bio.DnaSeq Python class: obj_tables.bio.DnaSeqAttribute XLSX type: String Python type: Bio.Seq.Seq
DNA sequence
bio.ProteinSeq Python class: obj_tables.bio.ProteinSeqAttribute XLSX type: String Python type: Bio.Seq.Seq
Protein sequence
bio.RnaSeq Python class: obj_tables.bio.RnaSeqAttribute XLSX type: String Python type: Bio.Seq.Seq
RNA sequence
bio.Seq Python class: obj_tables.bio.SeqAttribute XLSX type: String Python type: Bio.Seq.Seq
Sequence
bio.FeatureLoc Python class: obj_tables.bio.FeatureLocAttribute XLSX type: String Python type: Bio.SeqFeature.FeatureLocation
Location of a sequence feature
bio.FreqPosMatrix Python class: obj_tables.bio.FreqPosMatrixAttribute XLSX type: String Python type: Bio.motif.matrix.FrequencyPositionMatrix
Frequency position matrix

Required arguments

A few of the attributes have required arguments. Please see the Python documentation for more information.

  • Enum attribute: The first argument must be a subclass of enum.Enum or a list of the enumerated values. For example, an enumerated attribute with two potential values 'A' and 'B' can be specified in the tabular schema format using the syntax Enum(['A', 'B']).
  • *To* attributes: The first argument must be a class or the name of a class. The constructor must also receive a keyword argument with the key related_name whose value is a string that represents the name of a virtual property that should be added to the related class to represent pointers from the related objects back to the primary objects. For example, an attribute with the name parents which represents a many-to-many relationship from a class Child that represents children to a class Parent that represents their parents and contains a virtual attribute children that represents the inverse many-to-many relationship can be specified in the tabular schema format using the syntax ManyToMany('Parent', related_name='children').
  • ToManyGrammar attributes: The first argument must be a class or the name of a class. The constructor must also receive a keyword argument with the key grammar whose argument is a string that defines the grammar or is the path to a file that defines the grammar. The grammar must be defined using Lark's EBNF syntax . In addition, Python developers can customize how data encoded into a grammar is transformed into class instances and attributes by (1) creating a subclass of obj_tables.ToManyGrammar and (2) setting its Transformer attribute to a subclass of obj_tables.ToManyGrammarTransformer. Please see the Python documentation for more information.

Optional arguments

Most of the attributes also have optional arguments that can be used to control how values of the attribute are validated. For example, the Integer and Float attributes have optional min and max arguments which can be used to indicate the minimum and maximum valid values. The String and LongString attributes have optional min_length and max_length attributes which can be used to indicate the minimum and maximum valid lengths of each value.

Specifying required and optional attributes

The none argument controls whether an attribute is required (its value must be defined) or optional (its value can be undefined – null in XLSX, null in JSON and YAML, None in Python):

  • none=False means that the value of an attribute must be defined. This makes an attribute required.
  • none=True means that the value of an attribute can be undefined. This makes an attribute optional.

Specifying that an attribute can function as a foreign key for encoding relationships with instances of other classes into tables

The unique and primary arguments can be used to configure attributes to function as primary keys for encoding relationships between instances of classes into tables:

  • unique: Set this argument to True to indicate that ObjTables should constrain the values of the attribute to be unique across all instances of its parent class.
  • primary: Set this argument to True to indicate that the attribute is the primary key for its parent class. Each class can only have one primary attribute, and primary attributes should also have unique = True.

More information about the arguments

Please see the Python documentation for detailed information about these required and optional arguments.

File formats for datasets

ObjTables can encode and decode datasets into and from the file formats outlined below. We recommend using programs such as Microsoft Excel or LibreOffice Calc for viewing and editing datasets. We recommend using CSV or TSV to store datasets because these formats are conducive to revisioning with version control systems such as Git. We recommend using JSON for working with ObjTables in other programming languages beyond Python.

  • Tabular formats: Collections of tables in comma-separated (CSV), tab-separated (TSV), or XLSX format.
    • Set of CSV or TSV files (.csv, .tsv)
      • Each tabular file represents the instances of a class (and possibly related instances of other classes).
      • The names of the files should follow the pattern *.{csv,tsv} (e.g., Name.csv).
      • As described below, the table in each file should be proceeded by a line that begins with !!ObjTables id="TableName" ... that declares that ObjTables should process the table in the file. This declaration can also contain metadata about the following table.
      • ObjTables ignores files that do not contain this declaration. This can be used to store additional metadata alongside a dataset.
      • Sets of CSV and TSV files can be uploaded to the web application as a zip archive.
      • ObjTables uses the Excel dialect of the CSV and TSV formats .
    • Single text file which contains multiple CSV or TSV-formatted tables (.multi.csv, .multi.tsv)
      • This format is similar to a set of CSV or TSV files, except that the tables are concatenated into a single file.
      • The beginning of each table (and the end of the previous table) is indicated by the !!ObjTables id="TableName" ... declarations.
    • XLSX workbook (.xlsx)
      • Open standard: ECMA-376 , ISO/IEC 29500 .
      • This format is also similar to a set of CSV or TSV files, except that each table is contained in a worksheet in a workbook rather than in a separate text file.
      • The title of each worksheet should be !! followed by the name of the class in the schema (e.g., !!Name) encoded into the worksheet.
      • Optionally, workbooks can include a table of contents worksheet that summarizes and provides hyperlinks to the data tables. This should have the title !!_Table of contents.
      • Optionally, workbooks can include a worksheet that describes the schema of the dataset. This should have the title !!_Schema.
      • The ObjTables software tools can generate these optional table of contents and schema worksheets.
      • Workbooks can contain additional worksheets whose names do not begin with !!. These worksheets will be ignored by ObjTables. This can be used to store additional metadata alongside a dataset.
  • Data serialization formats: Serialization of ObjTables' internal data structures. See below for more information about using these formats in other languages.
    • JavaScript Object Notation (.json)
    • YAML Ain't Markup Language (.yml)

Tabular format for datasets

Dataset/document declaration: declaring that a collection of tables is encoded in ObjTables' tabular format

Each collection of tables must declare that the tables are encoded in ObjTables's tabular format by including the line !!!ObjTables ... before one of the tables.

Dataset/document-level metadata

Optionally, document declarations (!!!ObjTables ...) can capture key-value pairs of document-level metadata. Each key and value should be a string. At a minimum, we recommend using the keys below. The ObjTables software automatically generates these keys when datasets are exported to files.

  • id: Use this key to provide a unique identifier for the dataset.
  • schema: Use this key to annotate the name of the schema of the dataset. If this key is set, ObjTables will check that its value matches the name of the schema used to interpret the dataset. The name of tabular-formatted schemas can be declared using the name meta attribute (e.g., !!ObjTables type="Schema" name="<schema-name>" ...). The name of Python-defined schemas is the name of the module which implements the schema, which is typically the filename of the module.
  • date: Use this key to annotate the date that the dataset was created or updated.
  • objTablesVersion: Use this key to indicate the version of the ObjTables tabular format used to encode the dataset.

Class/table declaration: declaring the class represented by each table

Each table must declare the class that it represents by including a line that begins with !!ObjTables type='Data' class='<class_name>' ...). The value of class should be the name of a class in the schema.

Class/table-level metadata

Optionally, table declarations can capture key-value pairs of class/table-level metadata. Each key and value should be a string. At a minimum, we recommend using the keys below. The ObjTables software automatically generates these keys when datasets are exported to files.

  • document: Use this key to annotate the id of the parent document.
  • schema: Use this key to annotate the name of the schema of the table. If this key is set, ObjTables will check that its value matches the name of the schema used to interpret the table. The name of tabular-formatted schemas can be declared using the name meta attribute (e.g., !!ObjTables type="Schema" name="<schema-name>" ...). The name of Python-defined schemas is the name of the module which implements the schema, which is typically the filename of the module.
  • tableFormat: Use this key to indicate the format (row, column, multiple_cells, or cell) used to encode the class into the table. See above for more information about these formats. If this key is set, ObjTables will check that its value matches the format for the class defined in the schema.
  • name: Use this key to briefly describe the class.
  • description: Use this key to capture an extended description of the class.
  • date: Use this key to annotate the date that the table was created or updated.
  • objTablesVersion: Use this key to indicate the version of the ObjTables tabular format used to encode the class.

Class attribute declaration: declaring the attribute represented by each entry of each table

  • row-formatted classes: After the class/table declaration, the cells in the first row of the table should declare the attributes represented by each column. Each cell should begin with ! followed by the name of the attribute as defined in the schema (e.g., !Id).
  • column-formatted classes: Similarly, after the class declaration, the cells in the first column of the table should declare the attributes represented by each row.
  • multiple_cells-formatted classes: After the declaration of the table of the parent class, the cells in the second row of the table should declare the attributes represented by each range of columns/rows.

Class instances

Class instances can be encoded into tables as follows:

  • row-formatted classes: Each instance is encoded into a row of the table.
  • column-formatted classes: Each instance is encoded into a column of the table. This can be used to encode a classes into a transposed table.
  • multiple_cells-formatted classes: Each instance is encoded into a range of cells within the row of its parent instance.
  • cell-formatted classes: Each instance is encoded into a cell in the row of its parent instance.

Attributes of class instances

The value of each attribute of each instance should be encoded into cells as follows:

  • Non-relational attributes: Each value should be serialized into a Boolean, number, or string that can be deserialized by the attribute's deserialize method. For example, dates can be serialized into the format YYYY-MM-DD, and times can be serialized into the format hh:mm:ss. Please see the Python documentation for more information about how each attribute should be serialized.
  • Relational (*-to-one) attributes: Each value should be represented by the serialized value of the primary attribute of the related instance. For example, a value of an attribute, which is an instance of a class CreditCard, which has a primary attribute number, should be serialized to the number of the credit card. See above for more information about how to set the primary attribute of a class.
  • Relational (*-to-many) attributes: Each value should be represented as a comma-separated list of the serialized value of the primary attributes of the related instances. For example, a value of an attribute which is a list of instances of CreditCard should be serialized to a comma-separated list of the numbers of the credit cards. Primary keys which contain commas or double quotation marks should be enclosed in double quotation marks, similar to CSV files. For example, a list of a, b, "c", d, e should be represented as a, "b, \"c\", d", e. See above for more information about how to set the primary attribute of a class.

Comments about instances of classes

Comments about individual class instances can be encoded as follows:

  • row-formatted classes: Comments can be encoded as rows with a single cell in the first column which begins with %/ and ends with /%. Comments should be placed above the instance that the comment applies to.
  • column-formatted classes: Comments can similarly be encoded as columns with a single cell that begins with %/ and ends with /%.

Additional tables, rows, and columns not in the schema

Sets of tables that encode datasets can include additional tables, rows, and columns. This can be used to encapsulate additional metadata alongside datasets.

  • Additional tables: ObjTables ignores tabular files or worksheets which do not contain class/table declarations. ObjTables also ignores worksheets whose names do not begin with !!.
  • Additional columns (row-formatted tables) / rows (column-formatted tables): For row-formatted tables, ObjTables ignores columns whose headings do not begin with !. For column-formatted tables, ObjTables ignores rows whose headings do not begin with !.

JSON/YAML format for datasets

To make it easy to use ObjTables with other programming languages, ObjTables can encode datasets into JSON or YAML documents, which can easily be parsed by a wide range of languages. ObjTables encodes datasets into JSON and YAML as described and illustrated below.

  • Classes and metadata: Datasets are encoded into dictionaries. These dictionaries contain the following key-value pairs:
    • Document metadata
      • Key: _documentMetadata.
      • Value: Dictionary of key-value pairs that represent the document-level metadata of the dataset.
    • Class metadata
      • Key: _classMetadata.
      • Value: Dictionary whose keys are the names of the classes in the schema for the dataset and whose values are dictionaries of key-value pairs that represent the metadata for each class.
    • Instances of classes
      • Keys: Names of the classes in the schema.
      • Values: Lists of dictionaries that represent each instance of each class in the dataset.
  • Instances of classes: Each object in a dataset is encoded into a dictionary that has the following keys and values:
    • Type
      • Key: __type
      • Value: Name of the class of the object.
    • Unique identifier (used for serializing and deserializing relationships)
      • Key: __id
      • Value: Unique integer-valued id for each object. The ObjTables software automatically generates these ids when it exports a dataset.
    • Attributes
      • Keys: Names of the attributes of the class of the object.
      • Values: Values of the attributes (which are encoded into Booleans, numbers, strings, dictionaries, and lists as appropriate for each type of ObjTables attribute).
  • Relationships between instances: Attributes which represent relationships between objects are encoding using the unique integer-value ids generated for each object:
    • Each attribute which represents a *-to-one relationship is encoded into the unique integer id of the related object (or null if there is no related object).
    • Each attribute which represents a *-to-many relationship is encoded into a list of the unique integer ids of the related objects.
{
      # dictionary of metadata about
      # the document
      "_documentMetadata":
      {
    "objTablesVersion": "<ObjTables version>",
    "date": "<Date>",
    ... # additional document metadata
  },

  # dictionary which maps the name of each class
  # to a dictionary with metadata about each class
  "_classMetadata":
  {
    "<ClassName>":
    {
      "id": "<ClassName>",
      "name": "<Class verbose name>",
      ... # additional metadata for the class
    },
    ... # dictionaries with metadata
        # for more classes
  },

  # for each class, lists of dictionaries
  # which represent the instances of the classes
  "<ClassName>":
  [
    {
      "__type": "<ClassName>",
      "__id": <unique integer id>,
      "<attribute name>": <value of attribute>,
      "<*to-one attribute name>":
        <integer id of related object>,
      "<*to-many attribute name>":
        [<integer id of related object>, ...],
      ... # additional attributes of the
          # instance of the class
    },
    ... # additional instances of the class
  ],
  ... # lists of dictionaries which represent the
      # instances of additional classes
}

This Python code illustrates how to decode datasets encoded into this JSON/YAML format. We recommend following this example to create methods for other languages.

Validation of ObjTables datasets

One of the major goals of ObjTables is to make it easy to rigorously validate the syntax and semantics of complex datasets. ObjTables achieves this by making it easy to utilize the six levels of validation summarized below. The validations marked with ○ can optionally be disabled.

  1. Syntactic validation: The ObjTables software performs the following checks:
    • A dataset is encoded in the one of the tabular, JSON, or YAML formats.
    • The dataset contains all of the classes/tables and attributes/columns defined in the schema.
    • The dataset contains no extra classes/tables or attributes/columns.
    • Each class/table is oriented (normal, transposed) as defined in the schema.
    • The document and class metadata are syntactically valid lists of key-value pairs.
  2. Attribute validation
    • ObjTables checks that the value of each attribute is consistent with its type. For example, ObjTables checks that the value of each Integer attribute can be decoded to an integer.
    • ObjTables checks that the value of each attribute is consistent with any constraints defined in the schema. For example, if an Integer attribute is defined with min and max arguments, ObjTables checks that each value is in between these quantities.
    • For *-to-* relationship attributes, ObjTables checks that each related object is defined within the same dataset.
    • Python schema developers can implement additional validation by creating subclasses of obj_tables.Attribute and overriding their validate methods. For example, this can be used to check that an attribute that represents the participants of a biochemical reaction represents an element-balanced reaction.
  3. Object validation: Python schema developers can validate the semantic meaning of each object by overriding the validate method of each class. For example, this can be used to check that the values of two attributes that represent that the chemical formula and molecular weight of a compound are consistent.
  4. Uniqueness validation:
    • For each attribute that is defined with the optional argument unique=True, ObjTables checks that the values of the attribute are unique across the dataset.
    • For each unique_together constraint of each class, ObjTables checks that the combinations of the values of the attributes defined in the constraint are unique across the dataset. For example, this can be used to check that objects that represent linear mathematical expressions (set of other objects) represent unique expressions.
  5. Class validation: Python schema developers can validate the semantic meaning of all of the objects of a class by overriding the validate_unique method of each class. For example, this can be used to check that a class that represents the nodes of a graph and their edges does not contain cycles.
  6. Dataset validation: Python schema developers can validate the semantic meaning of an entire dataset by creating a subclass of obj_tables.Validator and overriding its run method. For example, this could be used to validate that a dataset that represents a model of the biochemistry of a cell captures the self-replicating behavior of that cell.

Software tools

ObjTables provides several tools for working with schemas and datasets. The tools marked with ● are available through all of the user interfaces (the web application, web service, command-line program, and Python package). The tools marked with ○ are only available through the Python package. Please see below and the Python documentation for more information about the tools .

  • Generate a Python module that implements a tabular-formatted schema.
  • Generate a UML diagram for a schema.
  • Generate a template CSV, TSV, or XLSX file(s) for a schema.
  • Programmatically construct or edit a dataset.
  • Merge multiple datasets into a single dataset.
  • Partition a dataset into multiple datasets.
  • Migrate a dataset between versions of a schema. See the Python documentation for more information.
  • Use Git to revision a dataset.
  • Validate that a dataset adheres to a schema and report any errors.
  • Normalize a dataset into a deterministically reproducible ordering.
  • Sort a dataset into a random order.
  • Pretty print a dataset.
  • Use a schema to determine if two datasets are semantically equivalent.
  • Use a schema to evaluate the difference in the semantic meaning of two datasets.
  • Parse a dataset into a Python data structure.
  • Use a schema to convert a dataset to an alternate format.
  • Use a schema to convert a dataset to a dictionary of pandas data frames .

Interfaces

ObjTables include four interfaces: a web application, a web service, a command-line program, and a Python package. The web application, web service, and command-line program provide the same features. In addition to the features of the web application, web service, and command-line program, the Python package can programmatically query, edit, merge, split, revision, and migrate datasets. The Python package is also more flexible. For example, the Python package can support additional datatypes and custom validation.

We recommend beginning with the web application, web service, or command-line program. We recommend using the Python package when more flexibility is required, such as a custom data type, or to analyze datasets with Python tools such as NumPy, Pandas, scikit-learn, or SciPy.

Web application

The web application is available at objtables.org/app.

Command-line program

The following example illustrates how to use the command-line program to generate a template spreadsheet for a dataset, validate a dataset, and compare two datasets.

obj-tables gen-template /path/to/schema.xlsx /path/to/template.xlsx
obj-tables validate /path/to/schema.xlsx /path/to/dataset.xlsx
obj-tables diff /path/to/schema.xlsx /path/to/dataset-1.xlsx /path/to/dataset-2.xlsx

Installation instructions for the command-line program are available at docs.karrlab.org. Documentation for the command-line program is available inline by running obj-tables --help.

Web service

Documentation for the web service is available at objtables.org/api.

Python package

As described above, the Python package has significantly more capabilities than the web application, web service, and command-line program.

The following example briefly introduces the API by illustrating how to use the API to programmatically create, manipulate, analyze, and export the same address book of tech CEOs described above .

Additional and more extensive tutorials are available as interactive Jupyter notebooks at sandbox.karrlab.org. Detailed documentation is available at docs.karrlab.org. Documentation for each module, class, and function can also be obtained by executing help(<module/class/function>) (e.g., help(obj_tables.io.validate)).

Define schema

Please see the example above .

Create companies

apple = Company(
    name='Apple',
    url='https://apple.com/',
    address=Address(
        street='10600 N Tantau Ave',
        city='Cupertino',
        state='CA',
        zip_code='95014',
        country='US'))
facebook = Company(
    name='Facebook',
    url='https://facebook.com/',
    address=Address(
        street='1 Hacker Way #15',
        city='Menlo Park', state='CA',
        zip_code='94025',
        country='US'))
google = Company(
    name='Google',
    url='https://google.com/',
    address=Address(
        street='1600 Amphitheatre Pkwy',
        city='Mountain View',
        state='CA',
        zip_code='94043',
        country='US'))
netflix = Company(
    name='Netflix',
    url='https://netflix.com/',
    address=Address(
        street='100 Winchester Cir',
        city='Los Gatos',
        state='CA',
        zip_code='95032',
        country='US'))
companies = [
    apple,
    facebook,
    google,
    netflix,
]

Create CEOs

cook = Person(
    name='Tim Cook',
    type=PersonType.business,
    company=apple,
    email_address='tcook@apple.com',
    phone_number='408-996-1010',
    address=apple.address)
hastings = Person(
    name='Reed Hastings',
    type=PersonType.business,
    company=netflix,
    email_address='reed.hastings@netflix.com',
    phone_number='408-540-3700',
    address=netflix.address)
pichai = Person(
    name='Sundar Pichai',
    type=PersonType.business,
    company=google,
    email_address='sundar@google.com',
    phone_number='650-253-0000',
    address=google.address)
zuckerberg = Person(
    name='Mark Zuckerberg',
    type=PersonType.family,
    company=facebook,
    email_address='zuck@fb.com',
    phone_number='650-543-4800',
    address=facebook.address)

ceos = [
  cook,
  hastings,
  pichai,
  zuckerberg,
]

Get a property of a company

assert facebook.url == 'https://facebook.com/'

Edit a property of a company

facebook.url = 'https://about.fb.com/'

Validate address book

import obj_tables
errors = obj_tables.Validator().run(
    companies + ceos)
assert errors is None

Export address book to a file

import obj_tables.io
import os
import tempfile
dirname = tempfile.mkdtemp()
filename_xlsx = os.path.join(
    dirname,
    'address_book.xlsx'
)
obj_tables.io.Writer().run(
    filename_xlsx,
    companies + ceos,
    models=[
        Company,
        Person,
    ]
)

Import address book from a file

objects = obj_tables.io.Reader().run(
    filename_xlsx,
    models=[
        Company,
        Person,
    ],
    group_objects_by_model=False,
    ignore_sheet_order=True
)

Check if two CEOs are semantically equivalent

zuckerberg_copy = next(
    el for el in objects 
         if isinstance(el, Person) \
      and el.name == 'Mark Zuckerberg')
assert zuckerberg_copy.is_equal(
    zuckerberg)
assert zuckerberg_copy.difference(
    zuckerberg) == ''

Working with ObjTables datasets with other programming languages

We recommend that developers use the web service to work with ObjTables in other programming languages:

  • Visualizing a schema: Use the viz-schema endpoint to generate a UML diagram of a schema.
  • Generating a template for a dataset of a schema: Use the gen-template endpoint to generate a template for a dataset in CSV, TSV, merged CSV, merged TSV, or XLSX format.
  • Validating that a dataset is consistent with a schema: Use the validate endpoint to check that a dataset adheres to a schema and identify any errors.
  • Parsing a dataset into a native data structure:
    1. Use the convert endpoint to encode a dataset into a JSON document. See above for more information about how ObjTables encodes datasets into JSON.
    2. Parse the JSON-encoded document into a native data structure. For example, use Python's json.loads method to parse a JSON-encoded document into a combination of lists, dictionaries, and scalars.
    3. Implement a method for decoding a dataset from this native data structure. This Python module illustrates how to decode datasets from native Python data structures. We recommend following this example to create methods for other languages.
    4. Use this method to decode the dataset from the native data structure that represent the JSON-encoded document.
  • Differencing datasets: Use the diff endpoint to check if two datasets are semantically equivalent and identify any differences.
  • Pretty printing a dataset: Use the normalize endpoint to pretty print a dataset in XLSX format. This will print the classes and attributes in their canonical orders, highlight and freeze the headings of each table, embed help information about each column into comments, setup XLSX validation for each cell, and protect each worksheet. Optionally, pretty printed datasets can include an additional worksheet with a table of contents that summarizes and provides hyperlinks to each class, as well as an additional worksheet that summarizes the schema. We recommend embedding schemas with datasets when sharing datasets with others or publishing datasets.
  • Converting a dataset to other formats: Use the convert endpoint to convert a dataset among CSV, TSV, merged CSV, merged TSV, XLSX, JSON, and YAML formats.

The documentation for the web service contains detailed information about the inputs and outputs of each endpoint.

Useful related resources for working with ObjTables

Below are several resources that we recommend for working with ObjTables-encoded datasets:

  • Graphical workbook editors
    • Microsoft Excel : Leading workbook editor
    • Libre Office : Open-source workbook editor for Linux, Mac, and Windows
    • WPS Office : Free workbook editor for Linux, Mac, and Windows
  • Python : Language for programmatically interacting with ObjTables-encoded datasets.

Comparison with other data modeling tools

While ObjTables has many similarities to other toolkits, ObjTables's unique combination of features provides some advantages for some use cases. Below, we outline the advantages and disadvantages of ObjTables of over several other types of tools.

  • Workbook editors with validation such as Microsoft Excel or LibreOffice Calc: Workbook editors are user-friendly tools for quickly viewing and editing datasets, including relatively large datasets. However, workbook editors have limited data types, limited support for relationships, and limited support for validation. In addition, workbook editors do not separate workbooks from schemas such that schemas can be applied to multiple workbooks. As a result, tools for programmatically reading workbooks such as openpxyl cannot link related records across multiple worksheets into a connected object graph. ObjTables leverages the user-friendliness of workbook editors and adds the abilities to parse and validate workbooks with schemas that support many data types and rich validation at multiple levels. Consequently, ObjTables makes it easy to interact with datasets both as workbooks (with tools such as Excel) and as data structures (with languages such as Python).
  • Alternative spreadsheet schemas such as Table Schema : Table Schema also combines the ease of use of spreadsheets with the rigor of schemas. One advantage of Table Schema is that is more formally defined than ObjTables. However, Table Schema provides limited support for scientific data types such as chemical formulae and reaction equations; Table Schema doesn't support common spreadsheet design patterns such as encoding relationships into groups of columns and grammars; Table Schema doesn't support advanced validations such as checking that each reaction is mass balanced or that a network is acyclic; Table Schema doesn't provide an ORM for manipulating datasets with high-level data structures; and Table Schema doesn't leverage XLSX formatting to provide inline help and validation.
  • Low and no-code databases such as Airtable , Kinetone , and QuickBase : Low and no-code databases also combine the ease of use of spreadsheets with the rigor of schemas. Furthermore, many of these tools provide web interfaces which faciltiate collaboration. However, these tools provided limited support for scientific data types such as chemical formulae and reaction equations; these tools provide limited capabilities for advanced validations such as checking that each reaction is mass balanced or that a network is acyclic; these tools do not provide users high-level data structures for manipulating datasets; their graphical interfaces for viewing and editing datasets do not have all of the capabilities of spreadsheet programs such as Microsoft Excel; and these tools have limited capabilities to export datasets to the XLSX format, which is critical for archiving datasets as supplementary materials to journal articles. These tools are also difficult to incorporate into many computational workflows where investigators need to collaborate asynchronously by making local changes to their own copy of a dataset and then using a revisioning system such as Git to merge their changes with the community.
  • Web frameworks with object-relational mapping (ORM) tools such as Django , Ruby on Rails Active Record , and SQLAlchemy : ORMs make it easy to define schemas and provide extensive support for validation. Together with web frameworks, ORMs can be used to build graphical web-based interfaces for viewing and editing datasets. However, these tools are focused on viewing and editing individual records rather than on viewing and editing entire datasets at once. As a result, it is difficult to use these tools to construct interfaces that enable users to quickly view and edit datasets with the ease of workbook tools such as Excel. In contrast, ObjTables leverages workbooks and workbook editors as an interface for quickly viewing and editing entire datasets. As a result, ObjTables makes it easier to create copies of datasets, revision datasets with version control systems such as Git, share datasets with others, and publish datasets as supplementary materials of journal articles. Furthermore, ObjTables minimizes the number of tables required to encode datasets by making it easy to nest tables. In addition, ObjTables leverages the fact that many more people are familiar with workbooks than web frameworks and ORMs, which enables nearly anyone to view and edit ObjTables datasets. However, unlike web frameworks and ORMs, ObjTables does not support separate models from views, and ObjTables is not suitable for very large datasets.
  • Schemas for data serialization formats such as JSON Schema and XML schemas : Schema systems for data serialization formats such as JSON and XML make it easy to validate complex datasets that are encoded with formats that can easily be copied, shared, and published. However, viewers and editors for data serialization formats focus on viewing and editing individual objects, which can be cumbersome for large datasets. ObjTables improves over these systems by making it possible to convert data encoded in JSON and YAML into workbooks, which are easier to view and edit with tools such as Excel. In addition, ObjTables leverages the fact that many more people are familiar with workbooks than data serialization formats. However, ObjTables is not as mature as these systems, and ObjTables provides limited support for languages beyond Python.
  • Relational databases and querying tools such as MySQL Workbench for MySQL: Relational databases make it easy to define schemas and conduct arbitrary queries of datasets. However, querying tools such as MySQL Workbench provide cumbersome interfaces for viewing and editing large datasets, relational databases support limited data types, and it is difficult to validate relational databases without ORMs. Furthermore, using a relational database typically requires significant knowledge of its schema. ObjTables improves over these tools by leveraging workbooks and workbook editors for viewing and editing datasets, by supporting more data types, by using nested tables and grammars to simplify the representation of relationships, and by supporting validation.

Design choices

We designed ObjTables as a complement to the Open Office Spreadsheet XML format (XLSX; ECMA-376 ; ISO/IEC 29500 ) for several reasons.

Expand the capabilities of an easily readable and popular open file format

To make ObjTables an appealing option for supplementary datasets to journal articles, we wanted to base ObjTables on a file format that is easy for many researchers to read, and which is already a popular format for supplementary data. Furthermore, we wanted to base ObjTables on an open-source format. This narrowed our choices to the XLSX format or text-based formats such as CSV or TSV. While data serialization formats such as JSON, XML, and YAML provide some of the benefits of ObjTables, we did not want to base ObjTables on these formats because the GUI tools for these files do not make datasets as easy to read as spreadsheets programs such as Excel.

Although databases and ORMs also provide some of the benefits of ObjTables, we did not want to base ObjTables on a traditional relational database or a low-code database such as Airtable because we wanted journals to be able to archive datasets as a file, and we wanted readers to be able to use simple tools to read these files. Relational databases are too complex for many researchers, and commercial low-code tools are not designed to export and import datasets to and from files, which is critical for journal archiving. Another reason why we chose to base ObjTables on a file format rather than a database is that we wanted ObjTables to be compatible with asynchronous collaboration workflows with version control systems such as Git.

As a result, we chose to base ObjTables on top of the XLSX format, a popular and open format for spreadsheets.

Develop software tools that complement existing spreadsheet programs

Due to the popularity and significant capabilities of existing spreadsheet programs such as Excel, we wanted to enable users to continue to use their favorite programs on their favorite operating systems as much as possible. In particular, we wanted peer reviewers, editors, and readers of journal articles to be able to read ObjTables-encoded spreadsheets with their existing spreadsheet programs rather than have to install an additional program. We also wanted authors to be able to continue to do the majority of their work (e.g., combine data from multiple sources, munge each source, perform calculations, make charts) with their favorite spreadsheet program and only briefly have to use a new program for advanced features such as validation that are not available in programs such as Excel.

We did not want to develop an extension for a spreadsheet program such as Excel or Google Sheets because this would only benefit users of that one program as there is no universal format for extensions. In addition, an extension would only benefit users of specific versions of that program on specific operating systems because the interfaces for extensions change over time. We also did not want to build the ObjTables software as an extension of Excel or Google Sheets because we wanted researchers to be able to use ObjTables with free, open-source tools. We did not want to build the ObjTables software as an extension of an open-source spreadsheet tool such as LibreOffice Calc because these tools are much less popular than Excel.

As a result, we chose to develop the ObjTables software as a separate, OS-independent program from spreadsheet programs such as Excel. Furthermore, we tried to make the software as accessible as possible by providing it as a web application, command-line program, web service, and Python package.

Known limitations

ObjTables is under active development. Below are several open issues that we intend to address going forward. Please see the GitHub issue list for more information.

  • Currently, ObjTables has limited capabilities to convert schemas implemented with the Python package to the tabular format.
  • Currently, ObjTables has limited support for multiple inheritance.
  • Currently, ObjTables only supports one view per class.
  • Currently, ObjTables serializes Array and Table to individual cells. We are contemplating enabling these attributes to be serialized to separate files/worksheets.
  • Currently, ObjTables datasets can be exported to JSON and YAML. To facilitate use with other tools, we aim to also convert datasets to an SQL format such as SQLite.
  • Due to the limitations of the pint package, datasets that use Unit attributes cannot be pickled.
  • ObjTables is still somewhat memory and CPU-inefficient. In particular, the importing and exporting of datasets is inefficient. Going forward, we aim to improve the performance of ObjTables.

Future directions

Going forward, we hope that the community will adopt ObjTables as a standard for supplementary spreadsheets. Toward a community standard, we invite feedback and contributions. Please contact the developers to share feedback or get involved.

Guided by community input, potential directions include:

  • Develop additional data types to better support specific fields of science. Examples include data types for genetic variants, alleles, and geographic and spatial coordinates.
  • Support additional layout conventions such as multi-level headings.
  • Develop libraries for additional languages such as Java, MATLAB, and R.
  • Develop a registry of schemas to make it easier for communities to share schemas.
  • Develop a graphical user interface for editing schemas and structured spreadsheets. Several possibilities include
    • Web application for designing schemas.
    • Plugins for Microsoft Excel, Apple Numbers, and LibreOffice Calc.
    • Extension for Google Sheets.
  • Encourage data registries such as Dryad, FigShare, GitHub, and Zenodo to support ObjTables.
  • Lobby journals to encourage authors to publish reusable supplementary tables and to encourage reviewers to verify the reusability of supplementary tables.

FAQ

  • Why are ObjTables worksheets protected?

    ObjTables worksheets are protected to help users use programs such as Microsoft Excel and LibreOffice Calc as graphical user interfaces for editing datasets. Specifically, ObjTables protects the elements of spreadsheets which encode the schema (worksheets, column headings, etc.) to help researchers focus on editing their data without unintentionally changing the schema. This is designed to help researchers separate the development of schemas from the development of datasets, and adhere to schemas as they edit their datasets.

  • How can I un-protect a worksheet?

    Microsoft Excel users can un-protect a worksheet by following these steps:

    • Navigate to the worksheet
    • Open the 'Review' tab
    • Click the 'Unprotect Sheet' button

    More detailed instructions are available here .

Further help

Please contact the developers with any questions.

By using this site you agree to use cookies to collect limited personal information to help us improve ObjTables as outlined in our Privacy Policy.