API reference¶
This chapter contains detailed API documentation for HappyBase. It is suggested to read the user guide first to get a general idea about how HappyBase works.
The HappyBase API is organised as follows:
Connection
:- The
Connection
class is the main entry point for application developers. It connects to the HBase Thrift server and provides methods for table management. Table
:- The
Table
class is the main class for interacting with data in tables. This class offers methods for data retrieval and data manipulation. Instances of this class can be obtained using theConnection.table()
method. Batch
:- The
Batch
class implements the batch API for data manipulation, and is available through theTable.batch()
method. ConnectionPool
:- The
ConnectionPool
class implements a thread-safe connection pool that allows an application to (re)use multiple connections.
Connection¶
-
class
happybase.
Connection
(host='localhost', port=9090, timeout=None, autoconnect=True, table_prefix=None, table_prefix_separator='_', compat='0.96', transport='buffered', protocol='binary')¶ Connection to an HBase Thrift server.
The host and port arguments specify the host name and TCP port of the HBase Thrift server to connect to. If omitted or
None
, a connection to the default port onlocalhost
is made. If specifed, the timeout argument specifies the socket timeout in milliseconds.If autoconnect is True (the default) the connection is made directly, otherwise
Connection.open()
must be called explicitly before first use.The optional table_prefix and table_prefix_separator arguments specify a prefix and a separator string to be prepended to all table names, e.g. when
Connection.table()
is invoked. For example, if table_prefix ismyproject
, all tables tables will have names likemyproject_XYZ
.The optional compat argument sets the compatibility level for this connection. Older HBase versions have slightly different Thrift interfaces, and using the wrong protocol can lead to crashes caused by communication errors, so make sure to use the correct one. This value can be either the string
0.90
,0.92
,0.94
, or0.96
(the default).The optional transport argument specifies the Thrift transport mode to use. Supported values for this argument are
buffered
(the default) andframed
. Make sure to choose the right one, since otherwise you might see non-obvious connection errors or program hangs when making a connection. HBase versions before 0.94 always use the buffered transport. Starting with HBase 0.94, the Thrift server optionally uses a framed transport, depending on the argument passed to thehbase-daemon.sh start thrift
command. The default-threadpool
mode uses the buffered transport; the-hsha
,-nonblocking
, and-threadedselector
modes use the framed transport.The optional protocol argument specifies the Thrift transport protocol to use. Supported values for this argument are
binary
(the default) andcompact
. Make sure to choose the right one, since otherwise you might see non-obvious connection errors or program hangs when making a connection.TCompactProtocol
is a more compact binary format that is typically more efficient to process as well.TBinaryAccelerated
is the default protocol that happybase uses.New in version 0.9: protocol argument
New in version 0.5: timeout argument
New in version 0.4: table_prefix_separator argument
New in version 0.4: support for framed Thrift transports
Parameters: - host (str) – The host to connect to
- port (int) – The port to connect to
- timeout (int) – The socket timeout in milliseconds (optional)
- autoconnect (bool) – Whether the connection should be opened directly
- table_prefix (str) – Prefix used to construct table names (optional)
- table_prefix_separator (str) – Separator used for table_prefix
- compat (str) – Compatibility mode (optional)
- transport (str) – Thrift transport mode (optional)
-
close
()¶ Close the underyling transport to the HBase instance.
This method closes the underlying Thrift transport (TCP connection).
-
compact_table
(name, major=False)¶ Compact the specified table.
Parameters: - name (str) – The table name
- major (bool) – Whether to perform a major compaction.
-
create_table
(name, families)¶ Create a table.
Parameters: - name (str) – The table name
- families (dict) – The name and options for each column family
The families argument is a dictionary mapping column family names to a dictionary containing the options for this column family, e.g.
families = { 'cf1': dict(max_versions=10), 'cf2': dict(max_versions=1, block_cache_enabled=False), 'cf3': dict(), # use defaults } connection.create_table('mytable', families)
These options correspond to the ColumnDescriptor structure in the Thrift API, but note that the names should be provided in Python style, not in camel case notation, e.g. time_to_live, not timeToLive. The following options are supported:
max_versions
(int)compression
(str)in_memory
(bool)bloom_filter_type
(str)bloom_filter_vector_size
(int)bloom_filter_nb_hashes
(int)block_cache_enabled
(bool)time_to_live
(int)
-
delete_table
(name, disable=False)¶ Delete the specified table.
New in version 0.5: disable argument
In HBase, a table always needs to be disabled before it can be deleted. If the disable argument is True, this method first disables the table if it wasn’t already and then deletes it.
Parameters: - name (str) – The table name
- disable (bool) – Whether to first disable the table if needed
-
disable_table
(name)¶ Disable the specified table.
Parameters: name (str) – The table name
-
enable_table
(name)¶ Enable the specified table.
Parameters: name (str) – The table name
-
is_table_enabled
(name)¶ Return whether the specified table is enabled.
Parameters: name (str) – The table name Returns: whether the table is enabled Return type: bool
-
open
()¶ Open the underlying transport to the HBase instance.
This method opens the underlying Thrift transport (TCP connection).
-
table
(name, use_prefix=True)¶ Return a table object.
Returns a
happybase.Table
instance for the table named name. This does not result in a round-trip to the server, and the table is not checked for existence.The optional use_prefix argument specifies whether the table prefix (if any) is prepended to the specified name. Set this to False if you want to use a table that resides in another ‘prefix namespace’, e.g. a table from a ‘friendly’ application co-hosted on the same HBase instance. See the table_prefix argument to the
Connection
constructor for more information.Parameters: - name (str) – the name of the table
- use_prefix (bool) – whether to use the table prefix (if any)
Returns: Table instance
Return type:
-
tables
()¶ Return a list of table names available in this HBase instance.
If a table_prefix was set for this
Connection
, only tables that have the specified prefix will be listed.Returns: The table names Return type: List of strings
Table¶
-
class
happybase.
Table
(name, connection)¶ HBase table abstraction class.
This class cannot be instantiated directly; use
Connection.table()
instead.-
batch
(timestamp=None, batch_size=None, transaction=False, wal=True)¶ Create a new batch operation for this table.
This method returns a new
Batch
instance that can be used for mass data manipulation. The timestamp argument applies to all puts and deletes on the batch.If given, the batch_size argument specifies the maximum batch size after which the batch should send the mutations to the server. By default this is unbounded.
The transaction argument specifies whether the returned
Batch
instance should act in a transaction-like manner when used as context manager in awith
block of code. The transaction flag cannot be used in combination with batch_size.The wal argument determines whether mutations should be written to the HBase Write Ahead Log (WAL). This flag can only be used with recent HBase versions. If specified, it provides a default for all the put and delete operations on this batch. This default value can be overridden for individual operations using the wal argument to
Batch.put()
andBatch.delete()
.New in version 0.7: wal argument
Parameters: - transaction (bool) – whether this batch should behave like a transaction (only useful when used as a context manager)
- batch_size (int) – batch size (optional)
- timestamp (int) – timestamp (optional)
- bool (wal) – whether to write to the WAL (optional)
Returns: Batch instance
Return type:
-
cells
(row, column, versions=None, timestamp=None, include_timestamp=False)¶ Retrieve multiple versions of a single cell from the table.
This method retrieves multiple versions of a cell (if any).
The versions argument defines how many cell versions to retrieve at most.
The timestamp and include_timestamp arguments behave exactly the same as for
row()
.Parameters: - row (str) – the row key
- column (str) – the column name
- versions (int) – the maximum number of versions to retrieve
- timestamp (int) – timestamp (optional)
- include_timestamp (bool) – whether timestamps are returned
Returns: cell values
Return type: list of values
-
counter_dec
(row, column, value=1)¶ Atomically decrement (or increments) a counter column.
This method is a shortcut for calling
Table.counter_inc()
with the value negated.Returns: counter value after decrementing Return type: int
-
counter_get
(row, column)¶ Retrieve the current value of a counter column.
This method retrieves the current value of a counter column. If the counter column does not exist, this function initialises it to 0.
Note that application code should never store a incremented or decremented counter value directly; use the atomic
Table.counter_inc()
andTable.counter_dec()
methods for that.Parameters: - row (str) – the row key
- column (str) – the column name
Returns: counter value
Return type: int
-
counter_inc
(row, column, value=1)¶ Atomically increment (or decrements) a counter column.
This method atomically increments or decrements a counter column in the row specified by row. The value argument specifies how much the counter should be incremented (for positive values) or decremented (for negative values). If the counter column did not exist, it is automatically initialised to 0 before incrementing it.
Parameters: - row (str) – the row key
- column (str) – the column name
- value (int) – the amount to increment or decrement by (optional)
Returns: counter value after incrementing
Return type: int
-
counter_set
(row, column, value=0)¶ Set a counter column to a specific value.
This method stores a 64-bit signed integer value in the specified column.
Note that application code should never store a incremented or decremented counter value directly; use the atomic
Table.counter_inc()
andTable.counter_dec()
methods for that.Parameters: - row (str) – the row key
- column (str) – the column name
- value (int) – the counter value to set
-
delete
(row, columns=None, timestamp=None, wal=True)¶ Delete data from the table.
This method deletes all columns for the row specified by row, or only some columns if the columns argument is specified.
Note that, in many situations,
batch()
is a more appropriate method to manipulate data.New in version 0.7: wal argument
Parameters: - row (str) – the row key
- columns (list_or_tuple) – list of columns (optional)
- timestamp (int) – timestamp (optional)
- bool (wal) – whether to write to the WAL (optional)
-
families
()¶ Retrieve the column families for this table.
Returns: Mapping from column family name to settings dict Return type: dict
-
put
(row, data, timestamp=None, wal=True)¶ Store data in the table.
This method stores the data in the data argument for the row specified by row. The data argument is dictionary that maps columns to values. Column names must include a family and qualifier part, e.g. cf:col, though the qualifier part may be the empty string, e.g. cf:.
Note that, in many situations,
batch()
is a more appropriate method to manipulate data.New in version 0.7: wal argument
Parameters: - row (str) – the row key
- data (dict) – the data to store
- timestamp (int) – timestamp (optional)
- bool (wal) – whether to write to the WAL (optional)
-
regions
()¶ Retrieve the regions for this table.
Returns: regions for this table Return type: list of dicts
-
row
(row, columns=None, timestamp=None, include_timestamp=False)¶ Retrieve a single row of data.
This method retrieves the row with the row key specified in the row argument and returns the columns and values for this row as a dictionary.
The row argument is the row key of the row. If the columns argument is specified, only the values for these columns will be returned instead of all available columns. The columns argument should be a list or tuple containing strings. Each name can be a column family, such as cf1 or cf1: (the trailing colon is not required), or a column family with a qualifier, such as cf1:col1.
If specified, the timestamp argument specifies the maximum version that results may have. The include_timestamp argument specifies whether cells are returned as single values or as (value, timestamp) tuples.
Parameters: - row (str) – the row key
- columns (list_or_tuple) – list of columns (optional)
- timestamp (int) – timestamp (optional)
- include_timestamp (bool) – whether timestamps are returned
Returns: Mapping of columns (both qualifier and family) to values
Return type: dict
-
rows
(rows, columns=None, timestamp=None, include_timestamp=False)¶ Retrieve multiple rows of data.
This method retrieves the rows with the row keys specified in the rows argument, which should be should be a list (or tuple) of row keys. The return value is a list of (row_key, row_dict) tuples.
The columns, timestamp and include_timestamp arguments behave exactly the same as for
row()
.Parameters: - rows (list) – list of row keys
- columns (list_or_tuple) – list of columns (optional)
- timestamp (int) – timestamp (optional)
- include_timestamp (bool) – whether timestamps are returned
Returns: List of mappings (columns to values)
Return type: list of dicts
-
scan
(row_start=None, row_stop=None, row_prefix=None, columns=None, filter=None, timestamp=None, include_timestamp=False, batch_size=1000, scan_batching=None, limit=None, sorted_columns=False)¶ Create a scanner for data in the table.
This method returns an iterable that can be used for looping over the matching rows. Scanners can be created in two ways:
The row_start and row_stop arguments specify the row keys where the scanner should start and stop. It does not matter whether the table contains any rows with the specified keys: the first row after row_start will be the first result, and the last row before row_stop will be the last result. Note that the start of the range is inclusive, while the end is exclusive.
Both row_start and row_stop can be None to specify the start and the end of the table respectively. If both are omitted, a full table scan is done. Note that this usually results in severe performance problems.
Alternatively, if row_prefix is specified, only rows with row keys matching the prefix will be returned. If given, row_start and row_stop cannot be used.
The columns, timestamp and include_timestamp arguments behave exactly the same as for
row()
.The filter argument may be a filter string that will be applied at the server by the region servers.
If limit is given, at most limit results will be returned.
The batch_size argument specifies how many results should be retrieved per batch when retrieving results from the scanner. Only set this to a low value (or even 1) if your data is large, since a low batch size results in added round-trips to the server.
The optional scan_batching is for advanced usage only; it translates to Scan.setBatching() at the Java side (inside the Thrift server). By setting this value rows may be split into partial rows, so result rows may be incomplete, and the number of results returned by te scanner may no longer correspond to the number of rows matched by the scan.
If sorted_columns is True, the columns in the rows returned by this scanner will be retrieved in sorted order, and the data will be stored in OrderedDict instances.
Compatibility notes:
- The filter argument is only available when using HBase 0.92 (or up). In HBase 0.90 compatibility mode, specifying a filter raises an exception.
- The sorted_columns argument is only available when using HBase 0.96 (or up).
New in version 0.8: sorted_columns argument
New in version 0.8: scan_batching argument
Parameters: - row_start (str) – the row key to start at (inclusive)
- row_stop (str) – the row key to stop at (exclusive)
- row_prefix (str) – a prefix of the row key that must match
- columns (list_or_tuple) – list of columns (optional)
- filter (str) – a filter string (optional)
- timestamp (int) – timestamp (optional)
- include_timestamp (bool) – whether timestamps are returned
- batch_size (int) – batch size for retrieving resuls
- scan_batching (bool) – server-side scan batching (optional)
- limit (int) – max number of rows to return
- sorted_columns (bool) – whether to return sorted columns
Returns: generator yielding the rows matching the scan
Return type: iterable of (row_key, row_data) tuples
-
Batch¶
-
class
happybase.
Batch
(table, timestamp=None, batch_size=None, transaction=False, wal=True)¶ Batch mutation class.
This class cannot be instantiated directly; use
Table.batch()
instead.-
delete
(row, columns=None, wal=None)¶ Delete data from the table.
See
Table.put()
for a description of the row, data, and wal arguments. The wal argument should normally not be used; its only use is to override the batch-wide value passed toTable.batch()
.
-
put
(row, data, wal=None)¶ Store data in the table.
See
Table.put()
for a description of the row, data, and wal arguments. The wal argument should normally not be used; its only use is to override the batch-wide value passed toTable.batch()
.
-
send
()¶ Send the batch to the server.
-
Connection pool¶
-
class
happybase.
ConnectionPool
(size, **kwargs)¶ Thread-safe connection pool.
New in version 0.5.
The size argument specifies how many connections this pool manages. Additional keyword arguments are passed unmodified to the
happybase.Connection
constructor, with the exception of the autoconnect argument, since maintaining connections is the task of the pool.Parameters: - size (int) – the maximum number of concurrently open connections
- kwargs – keyword arguments passed to
happybase.Connection
-
connection
(*args, **kwds)¶ Obtain a connection from the pool.
This method must be used as a context manager, i.e. with Python’s
with
block. Example:with pool.connection() as connection: pass # do something with the connection
If timeout is specified, this is the number of seconds to wait for a connection to become available before
NoConnectionsAvailable
is raised. If omitted, this method waits forever for a connection to become available.Parameters: timeout (int) – number of seconds to wait (optional) Returns: active connection from the pool Return type: happybase.Connection
-
class
happybase.
NoConnectionsAvailable
¶ Exception raised when no connections are available.
This happens if a timeout was specified when obtaining a connection, and no connection became available within the specified timeout.
New in version 0.5.