Apache Tika is used in big data document processing pipelines to extract text and metadata from numerous file formats. Text extraction is a critical component for search systems.  While work on 2.0 has been ongoing for years, the Tika team released 2.0.0-ALPHA in January and will release 2.0.0 before Buzzwords 2021. In addition to dramatically increased modularization, there are new components to improve scaling, integration and robustness. This talk will offer an overview of the changes in Tika 2.0 with a deep dive on the new tika-pipes module that enables synchronous and asynchronous fetching from numerous data sources (jdbc, fileshare, S3), parsing and then emitting to other endpoints (fileshare, S3, Solr, Elasticsearch, etc).

Talk
Intermediate