Data exists in a time and a place. We are using an increasing amount of applications that are run on data that is both time-stamped and location-stamped. With the rise of Internet of Things (IoT) devices now being deployed, both data types now come to the fore.
If we’re logging into a dashboard console that allows us to look at the data recorded by an only occasionally-connected wind turbine or some other piece of industrial civil engineering, then knowing the time that any particular piece of data was created is important.
If our data estate spans a large number of turbines (or bridge sensors, or traffic monitors, or ruggedized safety devices carried by humans etc.) that are spread out over a wide geographic area, then location-aware data is of particular importance, because the source of data itself is an additional factor in terms of how important it is.
… and then came cloud computing
These core facts are compounded if we consider the way cloud computing is developing on a global scale, resulting in us putting ‘instances’ of cloud functionality in different datacenters around the planet. Once again we’re faced with even more time- and location-dependent data separation, all of which creates latency i.e. a time lag between us asking for data (or, more often, our applications and databases making that request) and when we’re actually able to get it.
So what do modern cloud-native software application development and data science professionals do about this challenge? The problem is, as these software engineering professionals attempt to grapple with the manual workarounds used to extend their applications to new geographies, they often create performance issues along the way.
A key technique used by cloud computing data architects to handle modern data access and management predicaments is data partitioning. The promise and core technology proposition here centers around the suggestion that partitioning data by location allows global organizations to tackle latency issues caused by distributed data.
VP of product marketing at database management systems company Cockroach Labs is Jim Walker. Reminding us that IT latency is directly linked to the ‘experience’ that the end user has with products or services, Walker says that organizations today must be able to ingest, analyze and act on data in real-time to provide the optimal user experience.
“The 100ms (milliseconds) Rule, coined by Paul Buchheit, creator of Gmail, refers to the human threshold of latency where interactions feel instantaneous. Above 100ms we humans start to send a time lag. To put that into perspective, information traveling from one side of the world to the other adds about 250ms of latency and that’s only if it moved on the most direct path. Unfortunately, data doesn’t travel in a straight line, so the distance between your servers and web users matters,” said Walker.
Data hops, skips & jumps around the planet
But distance isn’t the only challenge. The speed light can travel from New York to San Francisco in 14ms (in a vacuum), but data doesn’t travel in a vacuum. It travels through multiple different network devices and these ‘hops and skips’ also add latency – information that travels 100 miles but makes five hops will have more latency than a request that travels 2500 miles with only two hops. This means location is equally important in order to optimize how the data travels.
Due to these truisms, Cockroach Labs’ Walker insists that location needs to be the new driving and deciding vector by which we think about databases for modern applications and developers.
“As we move towards a more digitally savvy and instantaneous world, we need to shift away from a logical data model mindset to one that also recognizes the importance of the physical component – where you want to operate and where users will be. This becomes even more important when you factor in challenges of data privacy. Data needs to be closer to the user so that we can get it to them faster and meet the 100ms rule,” said Walker.
The ability to attach distributed data at the row-level to a geographical location, known as geo-partitioning, was developed by Google to meet latency requirements in globally dispersed environments. Walker says that this delivers a level of automation that allows the data team to decide where the data should physically reside while giving the administrator options to modify these requirements in production.
He notes that latency can be reduced by minimizing the distance between where queries are issued and where the data to satisfy those queries resides. We ‘simply’ alter a configuration and the database physically moves the data to where it needs to be. This means as an organization expands its business into new geographies, it need not necessarily incur downtime.
“Data needs to automatically adapt to traffic patterns to reduce latency and needs to have high availability so that if one datacenter goes offline, there would be no lag in service as the data is stored in a secondary nearby data center that can quickly respond. Oftentimes, the distances involved in global deployments means developers must always make a tradeoff between availability and latency. However, partitioning by location in the database enables developers to build high available and low latency applications. It seems like a luxury now, but when applications are expected to perform and automate at light speed, this will become a necessity,” reinforced Cockroach Lab’s Walker.
Latency responsibility for the 5G road ahead
It is important to note that providing experiences at light speed isn’t just about the speed of automation; it’s about being able to manage and control automation.
With the rise of 5G speeds, our notion of what real-time computing is going to mean could change to a far more light speed version of system performance. The speed at which our data can access the application layer and the speed at which our application later can access data is going to become even more of a pressing issue than it already is.
When it comes to hotels, houses and homesteads, it’s definitely still location, location, location — but in the world of carefully configurated data architectures able to span globally dispersed deployment requirements, it’s more a question of location, location, partition.