Besides Azure portal, you can also do document intelligence and index creation in ML studio. The entire process of index creation includes several steps, crack_and_chunk, generate_embeddings, update_index, and register_index. In Azure ML studio you can create or use components for each of those steps and stitch them together as a pipeline.
Section 1. What is it?
Usually, a ML pipeline component does the job in serial, for example, it crack_and_chunk each input file, i.e., pdf file, one by one. If there are a couple of thousands of files, it would take several hours to finish the crack_and_chunk, and several hours for generate_embeddings, a total of a dozen hours for the entire index creation job. Imagine if there are hundreds of thousands or millions of files, it would take weeks to finish the entire index creation process.
Parallel processing capability is extremely important to speed up the index creation process, where the two most time-consuming components are crack_and_chunk and generate_embeddings.
Below figure shows the two components applying parallel processing capability for index creation: crack_and_chunk_with_doc_intel_parallel and generate_embeddings_parallel.
Section 2. How is the parallelism achieved?
Given crack_and_chunk_with_doc_intel_parallel component as an example, the logic of parallel process is like this: the ML job is run on a compute cluster which includes multiple nodes with multiple processors in each node, all files in the input folder is distributed into mini_batches, so each processor can handle some mini_batches, in this way, all processors can execute the crack_and_chunk job in parallel. Compared with serial pipelines, the parallel processing significantly improves the processing speed.
Below shows an experiment of creating an index on about 120 pdf files, and compared the time spent on each step of the index creation. Parallel processing improved the speed a lot. Running on GPU cluster is even faster than on CPU cluster. I want to make a note here, for parallel processing, there is overhead at the beginning of the job for scheduling the tasks to each processor, for small number of input files, time saving of parallel processing comparing to serial process may not be significant; but if the number of input files is huge, the time saving will be more significant.
How is the parallelism implemented in Azure ML? Please see this article:
How to use parallel job in pipeline - Azure Machine Learning | Microsoft Learn
There are several functions: init(), run() and shutdown(). The shutdown() function is optional.
Section 3. Code example
Please see the code in azure-example github repo as example:
This code repo creates parallel run component crack_and_chunk_doc_intel_component_parallel and stitches other Azure built-in components together to create a ML pipeline, the file crack_and_chunk_with_doc_intel/crack_and_chunk_parallel.py implements the parallelism logic. Several ways of providing .pdf inputs are addressed in the .ipynb files in this code repo.
There are some especially important features supported in this implementation:
Be sure to check out this article for guidance of setting optimum parameters for parallel processing:
ParallelRunStep Performance Tuning Guide · Azure ML-Ops (Accelerator) (microsoft.github.io)
Section 4. Benefits of using Azure ML
Although there are other ways of creating AI search indexes, there are benefits of creating indexes in Azure ML.
For this parallel processing feature, a header is added to indicate the crack_and_chunk_parallel processing API call.
Some other capabilities can be built on top of this parallel processing ML pipeline:
Section 5. Future enhancements:
Some future enhancements are considered, for example, re-indexing, which is to detect the changes in input files and only update the index with the changes. We will experiment with that part and publish Part 2 of the solution in the future.
Acknowledgement:
Thanks to the reviewers for providing feedback, involving in the discussion, reviewing the code, or sharing experience in Azure ML parallel processing:
Alex Zeltov, Vincent Houdebine, Randy Thurman, Lu Zhang, Shu Peng, Jingyi Zhu, Long Chen, Alain Li, Yi Zhou.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.