Hemant Vishwakarma: Finding Similar Document

Wednesday, 26 December 2018

Finding Similar Document

I am working on a project where I have processes and stored documents of Single-page Medical Reports with Labelled Categories. The user will input one document and I have to classify which category it belongs to.

I have converted all documents to grayscaled image format and stored for comparison purposes.

I have a dataset of images having following data,

image_path: This column has a path to the image
histogram_value: This column has a histogram of the image, calculated using cv2.calcHist function
np_avg: This column has an average value of all pixel of the image. Calculated using np.average
- category: This column is a category of the image.

I am planning to use these two methods,

Calculate histogram_value of the input image, find nearest 10 matching images
- Calculate np_avg of the input image, find nearest 10 matching images
- Take intersect of both result set
- If more than one image found, do template matching to find the best fit.

I have very little knowledge in the Image Processing domain. Will the above mechanism is reliable for my purpose?

I check SO, found few questions on same but they have a very different problem and desired outcome. This question looks similar to my situation but it's very generic and I am not sure it will work in my scenario.

from Finding Similar Document

Hemant Vishwakarma

Wednesday, 26 December 2018

Finding Similar Document

No comments:

Post a Comment