What's new in Apache Tika 2.0 -- we mean it this time!

Search

Apache Tika is used in big data document processing pipelines to extract text and metadata from numerous file formats. Text extraction is a critical component for search systems. While work on 2.0 has been ongoing for years, the Tika team released 2.0.0-ALPHA in January and will release 2.0.0 before Buzzwords 2021. In addition to dramatically increased modularization, there are new components to improve scaling, integration and robustness. This talk will offer an overview of the changes in Tika 2.0 with a deep dive on the new tika-pipes module that enables synchronous and asynchronous fetching from numerous data sources (jdbc, fileshare, S3), parsing and then emitting to other endpoints (fileshare, S3, Solr, Elasticsearch, etc).

Video

Slide

bbuzz21_what's new in Apache Tika 2.0.pdf

Kesselhaus

16.06.2021 18:50 – 19:20

Talk

Intermediate

Speakers

Tim Allison
Data Scientist
Jet Propulsion Laboratory, California Institute of Technology

Video

Slide

Speakers

Newsletter

Thank you!