Generic JDBC

This module provides a BinaryStore using JDBC. It is tested for H2, PostgreSQL and MariaDB only, but other database systems probably work as well. The module doesn’t include any JDBC driver, you need to pull in the one you want to use in your project.

The approach is as follows: the byte stream is split in chunks (size can be configured) and each chunk is stored in a blob column and associated to the same binary id.

Table Structure

Since the blob datatype characteristics differ between databases, you might want to define the table yourself or use the provided setup for PostgreSQL or MariaDB. See the CreateDataTable for the table definitions. Here is the definition for PostgreSQL:

CREATE TABLE IF NOT EXISTS "file_chunk" (
  "file_id" varchar(254) not null,
  "chunk_nr" int not null,
  "chunk_len" int not null,
  "chunk_data" bytea not null,
  primary key ("file_id", "chunk_nr")
)

This makes range requests efficient, since it is possible to calculate at what chunk to start (and to end if applicable).

The table names are just examples, they can be specified when creating a store.

Chunksize when storing and streaming

A caveat here is that the chunksize used to store a file, also determines the amount of memory used when reading the file. It is not possible to store in 512k chunks and then load it in 10k chunks, for example. The reason is that many jdbc drivers (at least these I know) don’t support streaming from a blob. You’ll get the whole blob in a byte array anyways. So this cannot be changed after a file has been stored. When streaming, the blob of each row is loaded in memory, one at a time. Its size defines the amount of memory used to stream a file.

If you’re using PostgreSQL, consider the pglo module which doesn’t have this restriction.

Usage

You need to provide a javax.sql.DataSource. How to create this is out of scope for this project. A JdbcStoreConfig is required, that defines some settings, like the table name and chunk size to use.

For the examples here, an in-memory database (H2) is used.

import binny._
import binny.util.Logger
import binny.jdbc._
import binny.ExampleData._
import cats.effect.IO
import cats.effect.unsafe.implicits._

val dataSource = ConnectionConfig.h2Memory("docs").dataSource
// dataSource: javax.sql.DataSource = ds0: url=jdbc:h2:mem:docs;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;DB_CLOSE_DELAY=-1 user=sa
implicit val logger = Logger.silent[IO]
// logger: Logger[IO] = binny.util.Logger$$anon$2@214738c1
val store = GenericJdbcStore.default[IO](dataSource, Logger.silent[IO])
// store: GenericJdbcStore[IO] = binny.jdbc.GenericJdbcStore$Impl@623dc959

// creates the schema, the table name is same as in the default config
DatabaseSetup.runData[IO](Dbms.H2, dataSource, "file_chunk").unsafeRunSync()
// res0: Int = 0

val someData = ExampleData.file2M
// someData: Binary[IO] = Stream(..)
val id = someData.through(store.insert)
  .compile.lastOrError.unsafeRunSync()
// id: BinaryId = BinaryId(8kAic6mb33RzWps9LAtHceyuhv74qmTonjdsVZEDpvK3)

// get the file out
store.findBinary(id, ByteRange.All).getOrElse(sys.error("not found"))
  .flatMap(binary => binary.readUtf8String)
  .unsafeRunSync()
  .take(50)
// res1: String = """hello world 1
// hello world 2
// hello world 3
// hello wo"""

JdbcBinaryStore

As seen above, the store is an instance of the trait JdbcBinaryStore. It extends BinaryStore to add a findBinaryStateful method. The “state” relates to the connection to the database.

The default findBinary method uses one connection per chunk. This allows to free resources each time the stream is pulled. Otherwise timeouts could occur, if for example the stream is not being pulled for a while. When a network client is consuming the stream with a slow connection, reading one chunk takes a while and could lead to the connection being closed (by the pool or server).

However, if you know to process only small files or consume the data fast, it is possible to stream the whole file using a single connection, which is faster. This is provided by findBinaryStateful.

ChunkedBinaryStore

The GenericJdbcStore also implements ChunkedBinaryStore to allow storing chunks independently. This is useful if chunks are received in random order and the whole file is not available as complete stream.

However, in order to use this the complete size of the file must be known up front. This is needed to know when the last chunk is received.