The Power of Data-Centric Technologies

Are you familiar with the term data-centric technology(DCT)? If not, I am glad you are reading this post. I am interested in several aspects of data that are so important nowadays and yet often overlooked. It is hard to imagine that anyone would deny the critical role of data in every application, but is our overall understanding of data genuinely representative?

The Merriam-Webster dictionary defines data as:

  1. Factual information (such as measurements or statistics) used as a basis for reasoning, discussion, or calculation.
  2. Information in digital form that can be transmitted or processed.
  3. Information output by a sensing device or organ that includes both useful and irrelevant or redundant information and must be processed to be meaningful.

By this definition, we identify a pattern: …a basis for reasoning, discussion, or calculation… can be transmitted or processed … information output …, that is, we find data in the input and/or output of some meaningful process.

Retiring the “Data Is” vs “Data Are” Debate

Before we continue, it is worth mentioning that Merriam-Webster also describes data as a noun, plural in form but singular or plural in construction. I like that. It servers to settle the (often unnecessary) discussion: Should I say data is or data are? Should I refer to datum when referring to a single piece of data? I don’t know about you but saying, or writing datum doesn’t feel natural (although not wrong). Therefore, in this post, I will always refer to data as a construction, which is either singular or plural.

The Data Rank

You have probably heard about the DIKW pyramid, a reference to an evolutionary relationship between data, information, knowledge, and wisdom. This is another notion that places data in the lower end of a hierarchical system. It makes sense in the context of information management and related areas, but… could data do more than be raw?

From the pragmatic perspective of computation and, more specifically, software development, the famous saying applies: data is data. We don’t compute over information, knowledge, or wisdom. We can infer these things from various processing mechanisms, but in all these computational stages, data is a construction! And as soon we see and understand data as a construction, we may realize virtually endless technological opportunities. Let us discuss some of them next.

DCTs for Decentralization

What makes blockchain “decentralized”? One could focus on the nature of the processes in a blockchain setting, such as decision-making that is delegated to a distributed network instead of being conducted by a single entity. How is this even possible? A central component of blockchain technologies is a distributed ledger, which is immutable and designed to be operated in a peer-to-peer, trustless fashion. In other words, it is how data is structured that defines how data is handled, which opens up possibilities for establishing a new technology. A public blockchain is often described as a data structure, somewhat similar to a conventional database, with the differentiators of being decentralized, distributed, and (historically) immutable; that is, once data is added to the ledger, it cannot be altered or deleted.

Blockchain is a perfect example of DCT: data is structured in a certain way that an entire technological ecosystem is enabled. Think about public and private blockchains, cryptocurrencies, smart contracts, non-fungible tokens (NFTs), and so much more. At the center of all and many other applications, there is a particular structure and treatment of data that speaks to the nature and benefits of DCTs.

DCTs: From Local to Global

Dr. Carlos Araujo, the founder of Symetrix and Algemetric, invented in 1986 a ferroelectric RAM (FeRAM) memory [1] that is now used in more than one billion devices worldwide. Among other features, Carlos’ FeRAM equipped smart cards and smart tags with an internal memory that was far more durable (supporting an exponentially larger number of readings and writings and resilient to much higher temperatures), with better performance and offering more storage. I remember when I first had the opportunity of reading the specifications of a smart card by Panasonic powered by Carlos’ memory in 2009. Panasonic took advantage of a more resilient memory and more available chip space for adding a co-processor with low-level functions that, if well-explored, could turn the smart card into a filesystem and the co-processor in some form of a micro-operating system. My first thought was that smart card could enable the development of a DCT-based application that could involve integrated services among public transportation, universities, shopping malls, and others. All of which would have a common interface with the smart card. At that time, my development team and I created a software development kit (SDK) with a complete API for managing data in the smart card for general-purpose applications. We developed demos for physical access control, product inventory, shopping, tax collection, authentication, among other applications. The API was high-level and straightforward enough that any existing system would easily communicate with that smart card. An entire new opportunity was available by creating a special data structure that could be explored over multiple platforms.

One of the most common questions I got by then was, “why not use the card just to read an ID and then search a database for its related data?”. I always thought that this question lacked imagination. The question typically implies an unnecessary overhead in interfacing with the smart card while reading an ID, and working with a database would do the trick. However, working with data in the smart card would allow a much more straightforward integration among independent parties. The simplest form of integration would consist of adding a component for communicating with the smart card API to read and write data. If not intended or required, there is no need to store data locally, as the smart card data would be available across several other applications for a wide variety of purposes.

I didn’t know that what I thought was a good idea was a reality all over Japan. Suica is a reloadable integrated circuit (IC) card for trains and bus travel in Tokyo that incorporates Carlos’ FeRAM memory. It truly saves time, money and addresses several other issues such as losing tickets. In Japan, you are not required to plan your itinerary as you do when purchasing tickets in advance. You can just buy the Suica card, load some credit and use it as you go. You can use Suica in trains, busses, taxis, and, as an example, the same card is used as a debit card at most shops in Tokyo.

Carlos’ FeRAM serves to show the extent of DCTs. A robust FeRAM memory serves as reliable data storage for a constrained compute environment where one can construct a data structure and a set of associated functions that persist only in the smart card and where end-to-end transactions can be completed without access to external resources. Having all the resources in a local setting has taken the smart card to a global stage.

DCTs as an enabler of Parallel Computing

As one example of DCTs as technology enablers, I highlight that there are ways in which data can be organized so it can be processed in a parallel fashion. This is not new in cryptography. The RSA cryptosystem [2] was initially designed to be computed via sequential computing. However, if one combines RSA and the Chinese Remainder Theorem (CRT), variant schemes of RSA [3] [4] can be accelerated via parallel processing. In fact, CRT is commonly used to enable parallel processing in many engineering applications [5] [6].

Since I want to keep the discussion focused on data and avoid math-speak this post, allow me to oversimplify and say that with CRT, an input number can be expressed as simultaneous linear congruences with coprime moduli, that is, moduli that do not share any common divisor other than one. Computing over the simultaneous linear congruences is equivalent to computing over that input number. The word “simultaneous” implies that these linear congruences are independent so that they can be computed simultaneously or, equivalently, in parallel.

Although very useful, CRT is not the only mathematical tool that can facilitate parallelism. In Geometric Algebra (GA) [7] [8], the main mathematical object of interest is the multivector. One interesting feature of GA for parallelism is that an n-dimensional GA operates over multivectors with 2n independently computed coefficients. That means that one can use an n-dimensional GA for enabling parallel computation by converting data to multivectors (in a structure-preserving way) and then computing over 2n pieces simultaneously. Among many other benefits, one compelling advantage of GA over CRT for parallel computing is that when using CRT for performing computation over n threads, one needs n distinct coprime moduli. Obviously, managing computation over arbitrarily large values of n implies managing an arbitrarily large number of coprime moduli. In GA, a single modulus is not even required. And if one wants to compute over finite sets, a single modulus (say, a single prime) is enough.

In this particular context, one can derive GA-based DCTs for parallel computing via data constructions allowing simultaneous computation over 2n threads for arbitrarily large values of n.

DCTs and The Shape of Data

Gunnar Carlsson [9] remarks that although Big Data has become a buzzword, its importance is not all about the "Big." Size is certainly interesting, but it does not capture the essence of the problem. Instead, the issue lies in the complexity both in the structure and format of data. According to Carlsson, without an organizing principle, working with data can be complicated to model. The organizing principle proposed by Carlsson is that data has a shape, and the shape matters.  A new data modeling is introduced based on topology, known as topological data analysis (TDA). The range of successful applications enabled by TDA is quite remarkable, including image processing, signal processing, sensors, viral evolution, bacteria classification, feature selection, among many others.

TDA-based solutions are one of the most interesting instances of DCTs. Carlsson remarks that the shape of data should be the first thing one should analyze when working with data, and many times it will be the only thing that matters at all in analyzing data.

Perhaps one way of describing Carlsson's contribution is as a type of framework in which one can analyze special data constructions and process information, make decisions, and provide answers also in the form of special data constructions. In contrast with prior data analysis techniques, the output of TDA is no longer a set of equations; instead, it is a topological network.

The Nature of DCTs

There are many famous examples of DCTs, much more than I can cover in this post. But what are DCTs in nature? There are many layers, flavors, facets, and nuances to it, and any attempt to provide a short definition could impose an inaccurate and undesired representation of the matter. Instead, we can try a simple exercise to capture the notion of DCTs.

Think about a function that performs a well-defined computationf takes x as input and outputs y. Obviously, if x is never provided, then neither will y, and the intended result of f won’t be realized. However, in this simple example, f is defined, and its functionality lies in f even if x does not occur. Now think about a function f where the functionality depends on

  • how x is constructed (e.g., as vector points, curves, clusters, distances, waves, grammar, words of a human language, commands of a programming language, etc.),
  • how x is processed (e.g., by plotting, compressing, inverting, reading, translating, executing, etc.),
  • how x is transmitted (e.g., in parallel, serial, partially, gradually, securely, teleporting, etc.), and
  • how x is stored (e.g., in digital databases, mobile devices, cloud, DNA, appended to external files, encrypted, using steganography, etc.).

In this second case, it is clear that f (the technology) is centered on x (the data). In other words, the decisions that led to how data is organized, represented, visualized, communicated, among other aspects, are the bases that define a technology, which then qualifies as a DCT.

Conclusion

There is a compelling number of reasons to justify careful, continuous, and perhaps audacious investigations on how to work with data. The way data is interpreted, formatted, displayed, transmitted, stored, among other aspects at the core of data-centric technologies, can define entirely new architectures, technologies, and even markets. Just like an immutable distributed ledger plays a central role in blockchain technologies, and storing data in smart cards leads to a broader integration involving multiple applications and services, and expressing data as independent components allows virtually any computation to be parallelized, and exploring topological features provides a general solution for data analyzes, data-centric technologies have the potential to significantly expand the universe of computational possibilities, taking a wide range of perspectives, from computational to economic efficiency.

References

[1] C. Araujo and e. al, "The future of ferroelectric memories," in 2000 IEEE International Solid-State Circuits Conference. Digest of Technical Papers (Cat. No.00CH37056), San Francisco, 2002.

[2] R. Rivest, A. Shamir and L. Adleman, "A method for obtaining digital signatures and public-key cryptosystems," Communications of the ACM, vol. 21, no. 2, pp. 120-126, 1978.

[3] Q. Liu, F. Ma, D. Tong and X. Cheng, "A regular parallel RSA processor," in The 2004 47th Midwest Symposium on Circuits and Systems, 2004. MWSCAS '04, Hiroshima, 2004.

[4]H. Nozaki, M. Motoyama, A. Shimbo and S. Kawamura, "Implementation of RSA Algorithm Based on RNS Montgomery Multiplication," Lecture Notes in Computer Science, vol. 2162, pp. 364-376, 2001.

[5] V. Ramm, O. T, S. W. Smith and H. G. Pavy, "High-speed ultrasound volumetric imaging system. II. Parallel processing and image display," IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, vol. 38, no. 2, pp. 109-115, 1991.

[6] K. Hemming, R. Lilford and J. A. Girling, "Stepped-wedge cluster randomised controlled trials: a generic framework including parallel and multiple-level designs," Statistics in Medicine, vol. 34, pp. 181-196, 2014.

[7] D. Hestenes, Space-Time Algebra, Birkhäuser, 2015.

[8] D. Hildenbrand, Foundations of Geometric Algebra computing, vol. 1479, Springer, 2013, pp. 27-30.

[9] G. Carlsson, "The shape of data," MATHEMATICAL SOCIETY LECTURE NOTE SERIES, vol. 403, no. 1, pp. 16-44, 2013.

Tags:
David Silva

Senior Research Scientist at Algemetric.

By clicking “Accept All Cookies”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.