Hard drives and NAND flash memory can store a lot more data than they could just a few years ago, but they’ve still got nothing on DNA. The genetic material in nearly every cell of your body has a vastly higher storage capacity than a hard drive, and it could potentially last for hundreds of thousands of years. The problem has been efficientlyencoding data in DNA. Now a pair of researchers from Columbia University and the New York Genome Center have developed a process for storing 214 petabytes of data per gram of DNA.
The DNA in our cells contains the instructions for building all the proteins that keep us running. DNA is made up of repeating sequences of the nucleic acids adenine, guanine, cytosine, and thymine (A, G, C, and T). These are sometimes called base pairs. Each sequence of three bases translates to a different amino acid, which are the building blocks of proteins. It’s data storage just like what we do with hard drives, but with much higher potential density.
The first time scientists were able to write and read digital data to DNA, they managed an effective capacity of 1.28 petabytes per gram. That’s nice and all, but Yaniv Erlich and Dina Zielinski improved that by a factor of 100. They successfully encoded a full computer operating system, an 1895 French film called “Arrival of a train at La Ciotat,” a $50 Amazon gift card, a computer virus, a Pioneer plaque, and a 1948 study by information theorist Claude Shannon. The key was not in producing the DNA, but how the data was split up and encoded in the first place.
Erlich and Zielinski call their process a “DNA Fountain.” First, all the files were compressed into a single master archive. An algorithm was used to take the binary code from that file and split it into short strings of digits. When translating the binary into base pair sequences, the algorithm is able to drop nucleotide sequences that are more likely to lead to read errors and replace them with others. Each bundle of strings is referred to as a droplet, and each droplet has a barcode in the sequence that tells the researchers where it fits when reassembling the file.
The researchers ended up with 72,000 DNA strands that contained the encoded data. The code was sent off to Twist Biosciences, a San Francisco company that can generate synthetic DNA from a provided sequence. In a few weeks Erlich and Zielinski received a vial containing the DNA molecules they coded. To read the data, they used standard DNA sequencing technology, then special software to reverse the encoding process. They were left with the original files, all perfectly intact.
DNA data storage could have a number of benefits at this level of density. As mentioned above, DNA can last a long time, and you’ll still be able to read it in a hundred years. It would be hugely difficult to read data stored on an antique 5.25-inch floppy disk, for example. Yaniv Erlich jokes that if the DNA data does become obsolete, we have bigger problems. Cost is still an issue, though. The team spent $7,000 creating the DNA archive and another $2,000 reading it. They also needed a lot of advanced equipment. Still, DNA storage might make some sense in the near future as a method of cold storage for large volumes of vital information.