Reliable storage in the age of big data

Most people might think of computer storage, like an external hard drive or flash drive, as something static that you can leave sitting somewhere indefinitely without issue. In reality, that’s not the case. For instance, data is stored in flash drives by capturing electrons in cells, and over time those electrons can escape for a variety of reasons, creating errors in the data or corrupting the data altogether.

Dr. Anxiao (Andrew) Jiang, associate professor in the Department of Computer Science and Engineering at Texas A&M University, said even the ambient room temperature can affect data storage.

“For example, for flash memory, every read or write of data will disturb many memory cells, changing the amount of electrons in them,” Jiang said. “Even if we do nothing and just leave the flash memory alone, as time goes on, the electrons in memory cells will still leak away which is a big problem for long-term storage of data. If the temperature becomes high, such as under sunshine in the summer, the electrons will leak even faster. So, errors are very common in stored data; and the longer data are stored, the more errors there will be.”

This problem makes reliable and secure long-term storage of data a problem, especially in an era where individuals and institutions are creating massive amounts of data all the time. To solve this problem, Jiang is combining coding theory with advances in deep learning, a fast-developing area in artificial intelligence to unveil a new way of securely storing data.

This is possible because text, images, video and audio files found in data are related to each other in complex ways.

“Such complex relations also mean the data has a lot of structures, which is mathematically equivalent to redundancy and can be used to correct errors,” Jiang said.

In coding theory, redundant, or identical, bits are added to the original data. These bits are not fully independent of one another. For example, if someone repeats the same message three times, those three copies are identical. If any one of those three copies has errors, the remaining two copies will show what the correct message should be.

“In real data the redundancy can take on more complex forms, such as a text that talks about "raining" may also talk about "umbrella" or other related things, or the bits in data may satisfy some mathematical equation, but the principle is the same – once we know the bits in data are dependent on each other in some way, we can use that knowledge to correct errors,” Jiang said.

Jiang believes integrating coding theory with artificial intelligence can lead to the development of new error correction algorithms which can be implemented in storage systems.

Previously, researchers believed the key to safely storing data was to add redundancy to the data using a mathematical tool called Error Correcting Codes that can correct a certain number of errors that appear. Unfortunately, this solution had limitations.

A problem arises when errors accumulate in the long term and the number of errors in a file exceeds what the error correcting code can correct. At that point, the loss of the file becomes inevitable.

Although adding more redundancy and attempting to remove errors from files periodically is still an option, it is challenging and costly.

Jiang and his collaborators, Dr. Krishna Narayanan, Eric D. Rubin ’06 Professor in the Department of Electrical and Computer Engineering at Texas A&M, and Dr. Jehoshua Bruck, Gordon and Betty Moore Professor at the California Institute of Technology, have found a new source of redundancy in the stored files themselves that can be used to correct errors and store data soundly.

Why it’s important

Ensuring that all data is stored reliably in computer systems impacts everyone. This study aims at creating a new tool for that grand goal. With new advances in artificial intelligence, especially deep learning, Jiang believes they may have the right tools to take on this challenge.

The primary goal of this project is to reveal how much redundancy exists in every type of data and how much of it is usable for error correction.

The second goal is to use the natural redundancy found in the data itself to realize highly reliable long-term data storage. Storage systems can be made more reliable by building them with the ability to understand the data they store.

The results of their work titled “Error Correction with Natural Redundancy,” can be used by various storage systems such as cloud storage built with nonvolatile memories. These memories work very fast but suffer from various types of errors.

The research team was recently awarded a grant from the National Science Foundation for this project on natural redundancy.