Generic JDBC
This module provides a BinaryStore
using JDBC. It is tested for H2,
PostgreSQL and MariaDB only, but other database systems probably work
as well. The module doesn’t include any JDBC driver, you need to pull
in the one you want to use in your project.
The approach is as follows: the byte stream is split in chunks (size can be configured) and each chunk is stored in a blob column and associated to the same binary id.
Table Structure
Since the blob datatype characteristics differ between databases, you
might want to define the table yourself or use the provided setup for
PostgreSQL or MariaDB. See the CreateDataTable
for the table
definitions. Here is the definition for PostgreSQL:
CREATE TABLE IF NOT EXISTS "file_chunk" (
"file_id" varchar(254) not null,
"chunk_nr" int not null,
"chunk_len" int not null,
"chunk_data" bytea not null,
primary key ("file_id", "chunk_nr")
)
This makes range requests efficient, since it is possible to calculate at what chunk to start (and to end if applicable).
The table names are just examples, they can be specified when creating a store.
Chunksize when storing and streaming
A caveat here is that the chunksize used to store a file, also determines the amount of memory used when reading the file. It is not possible to store in 512k chunks and then load it in 10k chunks, for example. The reason is that many jdbc drivers (at least these I know) don’t support streaming from a blob. You’ll get the whole blob in a byte array anyways. So this cannot be changed after a file has been stored. When streaming, the blob of each row is loaded in memory, one at a time. Its size defines the amount of memory used to stream a file.
If you’re using PostgreSQL, consider the pglo module which doesn’t have this restriction.
Usage
You need to provide a javax.sql.DataSource
. How to create this is
out of scope for this project. A JdbcStoreConfig
is required, that
defines some settings, like the table name and chunk size to use.
For the examples here, an in-memory database (H2) is used.
import binny._
import binny.util.Logger
import binny.jdbc._
import binny.ExampleData._
import cats.effect.IO
import cats.effect.unsafe.implicits._
val dataSource = ConnectionConfig.h2Memory("docs").dataSource
// dataSource: javax.sql.DataSource = ds0: url=jdbc:h2:mem:docs;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;DB_CLOSE_DELAY=-1 user=sa
implicit val logger = Logger.silent[IO]
// logger: Logger[IO] = binny.util.Logger$$anon$2@214738c1
val store = GenericJdbcStore.default[IO](dataSource, Logger.silent[IO])
// store: GenericJdbcStore[IO] = binny.jdbc.GenericJdbcStore$Impl@623dc959
// creates the schema, the table name is same as in the default config
DatabaseSetup.runData[IO](Dbms.H2, dataSource, "file_chunk").unsafeRunSync()
// res0: Int = 0
val someData = ExampleData.file2M
// someData: Binary[IO] = Stream(..)
val id = someData.through(store.insert)
.compile.lastOrError.unsafeRunSync()
// id: BinaryId = BinaryId(8kAic6mb33RzWps9LAtHceyuhv74qmTonjdsVZEDpvK3)
// get the file out
store.findBinary(id, ByteRange.All).getOrElse(sys.error("not found"))
.flatMap(binary => binary.readUtf8String)
.unsafeRunSync()
.take(50)
// res1: String = """hello world 1
// hello world 2
// hello world 3
// hello wo"""
JdbcBinaryStore
As seen above, the store is an instance of the trait
JdbcBinaryStore
. It extends BinaryStore
to add a
findBinaryStateful
method. The “state” relates to the connection to
the database.
The default findBinary
method uses one connection per chunk. This
allows to free resources each time the stream is pulled. Otherwise
timeouts could occur, if for example the stream is not being pulled
for a while. When a network client is consuming the stream with a slow
connection, reading one chunk takes a while and could lead to the
connection being closed (by the pool or server).
However, if you know to process only small files or consume the data
fast, it is possible to stream the whole file using a single
connection, which is faster. This is provided by findBinaryStateful
.
ChunkedBinaryStore
The GenericJdbcStore
also implements ChunkedBinaryStore
to allow
storing chunks independently. This is useful if chunks are received in
random order and the whole file is not available as complete stream.
However, in order to use this the complete size of the file must be known up front. This is needed to know when the last chunk is received.