Step4: CS Merging

To have a compact schema, we further reduce the number of tables in the emergent relational schema by merging CS's, using either semantic or structural information.

Semantic Merging. We can merge two CS's on semantic grounds when both CS class labels that we found were based on ontology information. Obviously, two CS's whose label was created using the same ontology class URI represent the same concept, and thus can be merged. If the labels stem from different ontology classes we can observe the subclass hierarchy and identify the common concept/class shared by both CS's (e.g., <Athlete> is a common class for <BasketballPlayer> and <BaseballPlayer>), if any, and then justify whether these CS's are similar based on the “generality” of the concept. Here the “generality” score of a concept is computed by the percentage of instances covered by it and its subclasses among all the instances covered by that ontology. Two CS's whose labels share a non-general ancestor in an ontology class hierarchy can be merged.

Structural Merging. The structural similarity between two CS's can be assessed by using the set of properties in each CS and the found relationships to them with other CS's. As original class can be identified based on “discriminating” properties (based on TF/IDF scoring), we merge two CS if their property sets have a high TF/IDF similarity score. Additionally, as a subject typically refers to only one specific entity via a property, we also merge two CS's which are both referred from the same CS via the same property.

Step5: Schema and Instance Filtering

We now perform final post-processing to clean up and optimize both the schema and the data instances in it. At part of this phase, all RDF triples are visited again, and either become stored in relational tables (typically >90 % of the triples, which we consider regular), and the remainder gets stored separately in a PSO table. Hence, our final result is a set of relational tables with foreign keys between them, and a single triple table in PSO format.

Filtering small tables. After the merging process, most of these merged classes (i.e., surviving merged CS's) cover a large amount of triples. However, it may happen that some classes still cover a limited number of RDF subjects, (i.e. less than 0.1 of all data). As removing these classes will only marginally reduce coverage, we remove them from the schema (except classes that were recognized as dimension tables with the described PageRank method). All triples of subjects belonging to these classes will be moved to the separate PSO table.

Maximizing type homogeneity. Literal object values corresponding to each attribute in a class can have several different types e.g., number, string, dateTime, etc. The relational model can only store a single type in each column, so in case of type diversity multiple columns will be used for a single property. As the number of columns can be large just due to a few triples having the wrong type (dirty data), we minimize this number by filtering out all the infrequent literal types (types that appear in less than 5 % of all object values) for each property. The triples with infrequent literal types are moved to the separate PSO table.

Minimizing the number of infrequent columns. Infrequent columns are those that have lots of NULL values. If the property coverage is less than a certain threshold value (i.e., 5 %), that property is infrequent and all the RDF triples of that property are treated as irregular data and moved to the separate PSO table.

Filtering the relationships. We further filter out infrequent or “dirty” relationships between classes. A relationship between csi and csj is infrequent if the number of references from csi to csj is much smaller than the frequency of csi (e.g., less than 1 % of the CS's frequency). A relationship is considered dirty if most but not all the object values of the referring class (e.g., csi) refer to the instances of the referred class (csj ). In the former case, we simply remove the relationship information between two classes. In the latter case, the triples in csi that do not refer to csj will be filtered out (placed in the separate PSO table).

Multi-valued attributes. The same subject may have 0, 1 or even multiple triples with the same property, which in our schema leads to an attribute with cardinality >1. While this is allowed in UML, direct storage of such values is not possible in relational databases. Practitioners handle this by creating a separate table that contains the primary key (subject oid) and the value (which given literal type diversity may be multiple columns). The MonetDB/RDF system does this, but only creates such separate storage if really necessary. That is, we analyze the mean number of object values (meanp) per property. If the meanp of a property p is not much greater than 1 (e.g., less than 1.1), we consider p as a single-valued property and only keep the first value of that property while moving all the triples with other object values of this property to the nonstructural part of the RDF dataset. Otherwise, we will add a table for storing all the object values of each multi-valued property.

< Prev   CONTENTS   Next >