Description
% This function takes in a camera image of a page with Thai text
% in a document format and process it to create a clean document format.
% The camera format may:
% – an RGB image
% – contain noise
% – regions that are not text (e.g. background that’s not on the page)
% – be rotated
% – have different lighting
% First convert the image into a grayscale image
% Use region labelling in 1D to find the number of characters
% and the horizontal locations of each character.
% threshold the image using locally adaptive thresholding
% invert the binary image so that the text becomes foreground
% Remove unwanted background that’s not text
% Do this by region labelling. Remove regions with sizes larger
% than a certain threshold (assume they are not text)
% Remove any labels with size smaller than a certain threshold
% (assume these are noise)
% threshold is +- standard deviation of the area
% Images are AND to remove unwanted artifacts
% rotate the image to the correct orientation
% Use Hough transform to find angle of rotation
% Only keep lines that are long enough to be considered
% More than half the length of the longest line.
% This removes any lines found that may correspond to small details of
% the character structure of the Thai language that produces
% weird/unwanted angles. (e.g. 45 degrees and -45 degrees appears often
% even with a perfectly aligned/rotated document).
% Find the mean, mode, and median of the angles for reference.
% Use the mode value for rotation (concluded from running script on
% many samples)
% The rotation angle must be modified to make sure it rotates
% correctly.
% find the areas where the sentences are and clean up noise
% First remove any regions that have an area larger than 1 std above
% the mean and with an extent of more than 1 std over the mean.
% Next find the bounding box for the text. Assuming the text is written
% in a document style with margins around the text box.
% Use an interpolation technique of the cumulation of number of pixels
% to find the edges of the bounding box and remove any noise outside
% the box.
% Then resize the image to the original size
% Do a final noise clean up and smoothing of the text by image erosion
% and dilation (morphological image processing). Open filter.
% Separate the sentences out (OPTIONAL: for noisy images, this is
% better used, if image not noisy then no need to do)
https://stackoverflow.com/questions/28935983/preprocessing-image-for-tesseract-ocr-with-opencv
Reviews
There are no reviews yet.