Newletter Sep21 contribution – IBM

TDF is a technology that is used to fabricate synthetic data. The goal is to populate databases with data that is coherent and appears realistic, despite being entirely fabricated. TDF is used either when real data does not exist or cannot be used due to privacy rules or to fabricate cases that do not exist in the real world. An external data observer (both human or a machine) should not distinguish between real data or fabricated data. To this end, TDF is based on rules. The fabrication is constrained by rules to create realistic data. Those rules guide the fabrication engine.

Two main enhancements have been recently added to the tool TDF — the Rule Editor and the parallel fabrication engine.

The fabrication rules are written in a declarative language. There are many types of functions in this language that cover many modeling requirements such as distribution of values, arithmetic operators, ratios between number of records in different tables, import of knowledge-based data, logic functions, data value uniqueness, and many more. To ease the data rules modeling phase the new Rule Editor was developed. This form-based editor enables users to code rules with no need to know the modeling language syntax. In addition, the Rule Editor has a mechanism to predict users’ intention. It suggests default values that might be need, thus saving rules modeling time. The Rule Editor enables creation of complex rules by combining other low-level functions. The depth of this tree-like rule hierarchy is unlimited, and the Rule Editor has the option to navigate them. The data sampling feature enables users to see a random sample data that is created by a rule.

The second new feature added to the TDF tool is the parallelism. Parallel data fabrication was added to the engine to boost the fabrication time. The main idea is to execute several constraints solvers simultaneously. Each solver is responsible for fabricating approximately one record. The challenge is to create coherent data for inter-record rules that affect several data records (such as value uniqueness requirement for a primary key field). Inter-record rules create dependency between the working solvers. The solution we have chosen is to reuse data from other solvers. The solvers are ordered, and a solver can ask data from a previous one. If the data is not ready, the solver waits. The basic solving engine uses a backtrack mechanism so the calculation might backtrack to the previous state. A complex multi-solver mechanism has been implemented to handle all possible data dependency cases.