How to Install Apache Tika on Ubuntu 22.04|20.04|18.04

Posted on 9 views

How can I install Apache Tika on Ubuntu 22.04|20.04|18.04?. Apache Tika is an Open source toolkit that detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). Tika is very useful for search engine indexing, content analysis, translation e.t.c.

What is new in Apache Tika 2.2.x

  • Add support for OneNote files downloaded from O365
  • Improve extraction of embedded files from MSOffice files created by non-Microsoft tools
  • Added back ability to ignore load errors in TikaConfig
  • Fix logic bug in PipesServer that prevented concatenation of content from attachments
  • Fix default logging in tika-app in batch mode
  • Fix race condition when starting multiple forked servers on multiple ports
  • Add metadata item for whether or not a PDF has a collection/is a Portfolio PDF
  • Add detection of JPEG XL, MARC, ICC profiles, NES-ROM file types
  • Add optional fetch ranges to FetchEmitTuple to allow range fetching from,e.g. http or s3

In this post, we will discuss the installation of Apache Tika on Ubuntu 22.04|20.04|18.04 LTS.

Apache Tika dependencies

What you need to build and install Apache Tika on Ubuntu 22.04|20.04|18.04 LTS are:

  • Java Runtime Environment (JRE)
  • Apache Maven

We will install these dependencies before we can download and install Tika on Ubuntu 22.04|20.04|18.04 Linux system.

Step 1: Install required dependencies

Start by ensuring you’re running an updated Ubuntu Desktop / Server.

sudo apt update
sudo apt -y install wget curl vim unzip

Step 2: Install Java on Ubuntu 22.04|20.04|18.04

As from Tika 1.19, build from Java 11 is supported. You can install Java on Ubuntu using the following commands:

sudo apt install -y default-jdk

Confirm installed version of Java:

$ java --version
openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment (build 11.0.13+8-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.13+8-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)

Step 3: Install Apache Maven

Install Apache Maven by following our guide:

Step 4: Download and Install Apache Tika

Download latest Apache Tika from the Downloads page.

export VER="2.2.1"
wget https://archive.apache.org/dist/tika/$VER/tika-$VER-src.zip

Unzip the downloaded file.

unzip tika-$VER-src.zip

Change to new folder and run mvn install

cd tika-$VER
mvn install

Sample installation output.

install-apache-tika-ubuntu-18.04-1024x637

Wait for the installation to finish then test Tika within its base directory.

Reference: http://tika.apache.org/2.2.1/gettingstarted.html

coffee

Gravatar Image
A systems engineer with excellent skills in systems administration, cloud computing, systems deployment, virtualization, containers, and a certified ethical hacker.