hbase

K. HBase Installation and Configuration

HBase is a type of NoSQL database. NoSQL is a general term meaning that the database isn’t an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB (https://en.wikipedia.org/wiki/Berkeley_DB) is an example of a local NoSQL database, whereas HBase is very much a distributed database.
HBase has many features which supports both linear and modular scaling. HBase clusters expand by adding RegionServers that are hosted on commodity class servers. If a cluster expands from 10 to 20 RegionServers, for example, it doubles both in terms of storage and as well as processing capacity. An RDBMS can scale well, but only up to a point - specifically, the size of a single database server - and for the best performance requires specialized hardware and storage devices.
Read more on HBase at URL https://hbase.apache.org/book.html

K1. Install HBase

Login as hduser. Ensure the following services started sequentially and check them with command jps
- SSH
- HDFS
- YARN
Check the installed version of Hadoop
```
$ cd ~
$ hadoop version
```
Reinstall the Hadoop if it is not version 3.3.6. Remember to delete the existing hadoop3 directory before begin your setup.
Check the installed version of PySpark
```
$ pyspark --version
```
Reinstall the Spark if it the PySpark not version 3.5.0 (for Hadoop 3.3.6 and later). Remember to delete the existing spark directory before begin your setup.
Check the installed version of Scala
```
$ scala -version
$ ll ~/kafka/libs | grep kafka
```
Reinstall the Kafka if it is not those version of your install Scala, say 2.13.x as shown in file kafka_(Scala-version)-(Kafka-version).*. Remember to delete the existing kafka directory before begin your setup.
Download HBase
```
$ cd ~
$ wget https://archive.apache.org/dist/hbase/2.5.7/hbase-2.5.10-bin.tar.gz
$ tar -xvzf hbase-2.5.10-bin.tar.gz
$ mv hbase-2.5.10 hbase
```
Find the current stable release of HBase that is compatible with your version of Hadoop at here (https://hbase.apache.org/book.html#hadoop), and you may find a list of releases at Apache Hbase download page (https://hbase.apache.org/downloads.html). For example, you may adopt version Hbase 2.5.x for the installed Hadoop 3.3.6.

K2. Configure HBase

Stop (if running) the Zookeeper and Kafka services
```
$ cd ~/kafka
$ bin/kafka-server-stop.sh
$ bin/zookeeper-server-stop.sh
$ jps
```
Attention! Please wait at least 30 seconds after issuing each command. Responses might be slow, and do use the command jps to observe the termination of services, i.e. HQuorumPeer and Kafka.
Edit the Bash profile of hduser by adding the following lines, and source the profile
```
export HBASE_HOME=/home/hduser/hbase
export PATH=$HBASE_HOME/bin:$PATH
```
Set environment variables in ~/hbase/conf/hbase-env.sh by uncomment and edit the environment variables
```
  export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
  export HBASE_HOME=/home/hduser/hbase
  export HBASE_CLASSPATH=${HBASE_HOME}/lib
  export HBASE_REGIONSERVERS=${HBASE_HOME}/conf/regionservers
  export HBASE_MANAGES_ZK=false
```
Please start both zookeeper and kafka servers before this step. If you are planning to run the Kafka service, then in hbase-env.sh change HBASE_MANAGES_ZK from true to false

Edit and add the properties of ~/hbase/conf/hbase-site.xml

 <configuration>
   <property>
     <name>hbase.cluster.distributed</name>
     <value>true</value>
   </property>
   <property>
     <name>hbase.rootdir</name>
     <value>hdfs://localhost:9000/hbase</value>
   </property>
   <property>
     <name>hbase.wal.provider</name>
     <value>filesystem</value>
   </property>
 </configuration>

Start the Zookeeper and Kafka services
```
$ cd ~/kafka
$ bin/zookeeper-server-start.sh config/zookeeper.properties &
$ bin/kafka-server-start.sh config/server.properties &
```
Attention! Please wait at least 30 seconds after issuing each command. Responses may be slow to start following your recent configuration. Please verify that Zookeeper and Kafka are running by running the jps command, i.e. you should see the process: HQuorumPeer and Kafka, before proceed to the next step.
Start the HBase
```
$ cd ~
$ ~/hbase/bin/start-hbase.sh
$ jps
```
Verify that HBase is running - you should see the HBase processes with jps command, i.e. HMaster and HRegionServer. Note that you may ignore the errors due to SLF4J during HBase startup, as it will eventually perform logging to a plain text in local file system. When you want to stop HBase, type the following commands
```
$ cd ~
$ ~/hbase/bin/stop-hbase.sh
$ jps
```
You may choose to clear all the HBase data, after stopped both HMaster and HRegionServer, using the following commands
```
$ hdfs dfs -ls /
$ hdfs dfs -rm -r /hbase
$ hdfs dfs -ls /
```
Delete the log4j-slf4j-impl-2.17.2.jar file (optional)
```
$ rm ~/hive/lib/log4j-slf4j-impl-2.17.2.jar
```
We delete the file log4j-slf4j-impl-2.17.2.jar because the similar file is also presented in the Hadoop directory, and it gives error to us occasionally

K3. Using HBase Shell

Login as hduser, say in another session, after HBase started
```
$ cd ~
$ ~/hbase/bin/hbase shell
```
Get a listing of commands
```
hbase(main):001:0> help
```
Check the status of the HBase cluster
```
hbase(main):002:0> status
```
To view all tables
```
hbase(main):003:0> list
```
To exit the HBase Shell
```
hbase(main):004:0> exit
```

K4. Running HBase Commands

Create a table named linkshare in the default namespace with one column-family called link
```
hbase> create 'linkshare', 'link'
```
Add a new column-family named statistics to the table.
```
hbase> disable 'linkshare'
hbase> alter 'linkshare', 'statistics'
hbase> enable 'linkshare'
```
Note that to alter the table (e.g. change column-family, add column-family, etc) after it has been created, you need to first disable the table to prevent clients from assessing the table during the alter operation.
Verify that the new column-family has been added to the table
```
hbase> describe 'linkshare'
```

Insert a value in a cell at the specified table/row/column, use the command put. In the following examples, we use the unique reversed URL link for the row keys

hbase> put 'linkshare', 'org.hbase.www', 'link:title', 'Apache HBase'
hbase> put 'linkshare', 'org.hadoop.www', 'link:title', 'Apache Hadoop'
hbase> put 'linkshare', 'com.oreilly.www', 'link:title', 'O\'Reilly.com'

For incrementing frequency counters, use the command incr. In the examples below, we indicate that the counter should be incremented by 1

hbase> incr 'linkshare', 'org.hbase.www', 'statistics:share', 1
hbase> incr 'linkshare', 'org.hbase.www', 'statistics:like', 1
hbase> incr 'linkshare', 'org.hbase.www', 'statistics:share', 1

To access a counter’s current value, use the get_counter command, specifying the table name, row key, and column
```
hbase> get_counter 'linkshare', 'org.hbase.www', 'statistics:share'
```

To perform lookups by row key to retrieve attributes for a specific row, use the get command

hbase> get 'linkshare', 'org.hbase.www'

The get command also accepts an optional dictionary of parameters to specify the column(s), timestamp, timerange, and version of the cell values to be retrieved. e.g.

hbase> get 'linkshare', 'org.hbase.www', 'link:title'
hbase> get 'linkshare', 'org.hbase.www', 'link:title', 'statistics:share'
hbase> get 'linkshare', 'org.hbase.www', ['link:title', 'statistics:share']
hbase> get 'linkshare', 'org.hbase.www', {TIMERANGE => [1399887705673, 1400133976734]}
hbase> get 'linkshare', 'org.hbase.www', {COLUMN => 'statistics:share', VERSIONS => 2}

Display all the contents of a table
```
hbase> scan 'linkshare'
```

Limit scan to the rows starting with a specific row key

hbase> scan 'linkshare', {COLUMNS => ['link:title'], STARTROW => 'org.hbase.www'}
hbase> scan 'linkshare', {COLUMNS => ['link:title'], STARTROW => 'org'}

Import the necessary classes

hbase> import org.apache.hadoop.hbase.util.Bytes
hbase> import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
hbase> import org.apache.hadoop.hbase.filter.BinaryComparator
hbase> import org.apache.hadoop.hbase.filter.CompareFilter

Create a filter that limits the results to rows where the statistics:like counter column value is less than or equal to 10
 likeFilter = SingleColumnValueFilter.new(Bytes.toBytes('statistics'),
 Bytes.toBytes('like'),
 CompareFilter::CompareOp.valueOf('GREATER_OR_EQUAL'),
 BinaryComparator.new(Bytes.toBytes(10)))
Set a flag for the filter to skip any rows without a value in this column
hbase> likeFilter.setFilterIfMissing(true)
Run a scan with the configured filter
hbase> scan 'linkshare', { FILTER => likeFilter }

K5. Data Definition Language (DDL) Commands

Create a table named ‘t1’ with a column family ‘cf1’
```
hbase> create 't1', 'cf1'
```
Create a table named ‘emp’, with two column families: ‘personal data’ and ‘professional data’
```
hbase> create 'emp', 'personal data', 'professional data'
hbase> list
```
The describe command
```
hbase> describe 'emp'
```
Create a namespace ‘ns1’
```
hbase> create_namespace 'ns1'
```
Create a table named ‘t1’ in the namespace ‘ns1’ with a column family cf1’ and maximum of 5 versions of all columns in the column family ‘cf1’
```
hbase> create 'ns1:t1', {NAME=>'cf1', VERSIONS=>5}
hbase> list
```
Create a table named ‘t2’ in the namespace ‘ns1’ with two column families ‘cf1’ and ‘cf1’
```
hbase> create 'ns1:t2', 'cf1', 'cf2'
hbase> list
hbase> describe 'ns1:t2'
```
The alter command. Add additional column families ‘cf3’, ‘cf4, and ‘cf5’ to the table ‘n1:t2’
```
hbase> alter 'ns1:t2', 'cf3', 'cf4', 'cf5'
hbase> describe 'ns1:t2'
```

Delete the column family ‘cf3’ of the table ‘ns1:t2’’

hbase> alter 'ns1:t2', NAME=>'cf3', METHOD=>'delete'

Delete the column family ‘cf4’ of the table ‘ns1:t2’

hbase> alter 'ns1:t2', 'delete'=>'cf4'
hbase> describe 'ns1:t2'

Change the maximum number of versions of the columns in a column family
```
hbase> alter 'emp', {NAME=>'personal data', VERSIONS=>5}
```

K6. Data Manipulation Language (DML) Commands

Inserting values into tables. Put a cell value at a specified table/row/column (and optionally timestamp) coordinates. Note that HBase does not support insertion of multiple columns in a single statement. Put values into table ‘ns1:t2’

hbase> put 'ns1:t2', 'key1', 'cf1:name', 'John'
hbase> put 'ns1:t2', 'key1', 'cf1:id', 19191919  
hbase> put 'ns1:t2', 'key1', 'cf2:city', 'London' 
hbase> put 'ns1:t2', 'key1', 'cf2:country', 'UK'  
hbase> scan 'ns1:t2'

Put values into table ‘emp’

hbase> put 'emp', '1001', 'personal data:name', 'Thor'
hbase> put 'emp', '1001', 'personal data:city', 'Kuala Lumpur'
hbase> put 'emp', '1001', 'professional data:designation', 'manager'
hbase> put 'emp', '1001', 'professional data:email', 'thor@mail.abc.com'
hbase> scan 'emp'

Using table references. Get a table reference to the table ‘ns1:t2’. The commands to be performed on that table can now be invoked directly on the table reference
```
hbase> t = get_table 'ns1:t2'
hbase> t.scan
```

Get a table reference to the table ‘ns1:t1’, and insert values into the table “ns1:t1” using the table reference

hbase> t = get_table 'ns1:t1'
hbase> t.put 'key2', 'cf1:city', 'KL'
hbase> t.put 'key2', 'cf1:id', 87654321
hbase> t.put 'key2', 'cf1:name', 'Minnie'
hbase> scan 'ns1:t1'

Updating table values by putting ‘<table_name>’, ‘<rowkey>’, ‘<column_family>:<column>’, ‘<new_value>’. Updating the value for ‘cf1.name’ to ‘Jack’
```
hbase> put 'ns1:t2', 'key1', 'cf1:name', 'Jack'
```

Updating the value for ‘cf1.city’ using the table reference t

hbase> t.put 'key1', 'cf1:city', 'Manchester'
hbase> t.scan

Reading row data / filtering data. Get a single row’s data
```
hbase> get 'ns1:t1', 'key1'
```

Get a single column’s data of a row

hbase> get 'ns1:t1', 'key2', {COLUMN=>'cf1:city'}

Filtering only certain columns

hbase> scan 'ns1:t1', {COLUMNS => ['cf1:name', 'cf1:city']}

Count the number of rows of a table
```
hbase> t.scan
hbase> t.count
```
Deleting cells in a table with syntax delete ‘<table_name>’, ‘<rowkey>’, ‘<column_family>:<column>’. Delete a specific cell in a table
```
hbase> delete 'ns1:t1', 'key1', 'cf1:city'
hbase> scan 'ns1:t1'
```

Delete all cells for a specific row

hbase> deleteall 'ns1:t1', 'key1'
hbase> scan 'ns1:t1'

Drop table

hbase> list
hbase> drop 't1'
hbase> disable 't1'
hbase> drop 't1'
hbase> list

K7. Attention: At the beginning of all future practical

Login as hduser or switch account to hduser
```
$ su - hduser
```
Start ssh,
```
$ sudo service ssh start
```
Start the HDFS service
```
$ sbin/start-dfs.sh
$ jps
```
Suppose you need to observe at least four (4) services, including both NameNode and DataNode, as stated in step 12 of G7. Otherwise, you may need to reformat the HDFS NameNode.
```
$ cd ~/hadoop3
$ bin/hdfs namenode -format
$ sbin/start-dfs.sh
$ hdfs dfs -mkdir /user
$ hdfs dfs -mkdir /user/hduser
```
Start the Yarn service
```
$ sbin/start-yarn.sh
$ jps
```
Suppose you need to observe at least six (6) services in the total, refer steps 12 and 13 of G7.
Start the Zookeeper and Kafka services
```
$ cd ~/kafka
$ bin/kafka-server-stop.sh
$ bin/zookeeper-server-stop.sh
$ bin/zookeeper-server-start.sh config/zookeeper.properties &
$ bin/kafka-server-start.sh config/server.properties &
```
Please wait at least 30 seconds after issuing each command. Note that responses may be slow. Suppose you need to observe at least eight (8) services in the total, including both HQuorumPeer and Kafka, as stated in step 6 of K2.
Start the HBase
```
$ cd ~
$ $HBASE_HOME/bin/start-hbase.sh
$ jps
```
Suppose you need to observe at least ten (10) services in the total, including both HMaster and HRegionServer, as stated in step 7 of K2.

K8. Attention: At the end of all future practical

Stop the HBase service
```
$ cd ~
$ $HBASE_HOME/bin/stop-hbase.sh
$ jps
```
Suppose the two services will be terminated, i.e. HMaster and HRegionServer.
Stop (if running) the Zookeeper and Kafka services
```
$ cd ~/kafka
$ bin/kafka-server-stop.sh
$ bin/zookeeper-server-stop.sh
```
Attention! Please wait at least 30 seconds after issuing each command. Responses may be slow. Suppose the two services will be terminated, i.e. HQuorumPeer and Kafka

Stop the YARN and HDFS services

$ cd ~/hadoop3
$ sbin/stop-yarn.sh
$ sbin/stop-dfs.sh
$ top

Ctrl-c to terminate the command top

Exit from your user account(s)
```
exit
```