Unitlab AI: Easiest Way To Label Datasets For Machine Learning

Data Annotation with UnitLab for Machine Learning

Key Concepts:

Data Annotation: The process of labeling data (images, text, audio, etc.) to train machine learning models.
OCR (Optical Character Recognition): Technology to convert images of text into machine-readable text.
UnitLab: A platform for data annotation and model deployment.
Data Augmentation: Techniques to artificially increase the size of a dataset by creating modified versions of existing data.
Coco Format: A common JSON-based format for storing annotation data.
Named Entity Recognition (NER): Identifying and classifying named entities (people, organizations, locations) in text.
Automation (in UnitLab): Utilizing pre-trained or custom models to automatically annotate data.
MMOCR: A framework used for training the OCR model (DBNET for text detection, AITE for text recognition).

1. Importance of Data Quality in Machine Learning

The video emphasizes that data quality is paramount in machine learning. Even with perfect architecture, training logic, and hardware, a machine learning process will fail with low-quality or non-existent data, or with poorly labeled data. The core focus of the video is demonstrating professional and efficient data annotation techniques. The goal is to start with raw data (in this case, images of Vietnamese text) and label it effectively to train a model, deploy it, and then use that model for automated labeling of larger datasets. This approach is applicable to various data formats including audio and health data.

2. Project Setup and Data Upload in UnitLab

The demonstration begins with two folders: one containing 200 images and another with 2,000 images of Vietnamese text. The workflow involves manually annotating the 200-image set and then training a model to automatically annotate the remaining 2,000 images. This is achieved using UnitLab.

Project Creation: A new project is created in UnitLab named "tutorial Vietnamese OCR," specifically configured for "image OCR" annotation type.
Data Upload: The 200 images are uploaded to the project. The user is confirmed as the annotator.
Transparency Note: The video acknowledges sponsorship from UnitLab, but highlights that the platform offers a free forever plan with the limitation of not supporting private datasets.

3. Manual and Automated Annotation Techniques

The video contrasts manual and automated annotation methods within UnitLab.

Manual Annotation (Not Recommended): The user could manually draw bounding boxes around text and type the corresponding text, but this is deemed inefficient.
Automated Annotation with Pre-trained Model: The preferred method involves utilizing UnitLab’s automation feature. A pre-trained "document OCR" model is selected. This model automatically annotates the images when the user drags a "magic crop tool" over the text areas. Manual adjustments are still possible to correct any errors. This semi-automated approach significantly speeds up the annotation process.

4. Data Annotation Workflow & Results

The annotation process involves dragging the "magic crop tool" across the images. The model automatically detects and labels the text. The user can zoom, select multiple images, and apply the annotation to all of them, then manually correct any inaccuracies. Once completed, the dataset is fully annotated and ready for export.

5. UnitLab’s Versatility: Beyond Text OCR

The video showcases UnitLab’s capabilities beyond text OCR, demonstrating its application to other data types:

Audio Segmentation: Labeling segments of audio with speaker identification (e.g., Speaker 1, Speaker 2) and transcription of the spoken content. An example is provided with a recording of someone saying, "Harry quickly climbed onto the back of the boat and hid."
Medical Annotation: 3D annotation of medical scans, specifically highlighting the spine and lungs. The "magic touch tool" allows for highlighting structures across different views of the scan.
Named Entity Recognition (NER): Labeling different entities within text, such as people, titles, and locations. An example using the text "Hamlet" demonstrates labeling characters as "person" and locations.

6. Exporting Data and Data Augmentation

Once the initial dataset is annotated, the next step is to export it for model training.

Export Format: The data is exported in the Coco format, a widely used JSON-based format for annotation data.
Data Augmentation: Before exporting, data augmentation is performed to artificially increase the dataset size. This involves applying transformations like blurring, cropping, rotation, saturation adjustments, and brightness changes. The goal is to create more robust models by exposing them to variations in the data. In this example, rotation and flipping are disabled, and approximately 1,000 augmented images are generated.

7. Model Training and Deployment (Overview)

The video provides a high-level overview of the model training and deployment process.

Training Framework: The MMOCR framework is used, specifically DBNET for text detection and AITE for text recognition.
Deployment: The trained model is deployed to an endpoint (e.g., on a server).
UnitLab Integration: The endpoint URL is integrated into UnitLab as a custom AI model.

8. Automated Annotation with Custom Model

After deploying the custom model, it can be used for automated annotation within UnitLab.

Model Integration: The endpoint URL of the deployed model is added to UnitLab as a new AI model.
Validation: A test image is uploaded to validate the model’s functionality.
Batch Annotation: The custom model is then used to automatically annotate the 2,000 remaining images in the dataset. The user can review and manually correct any errors.

9. Key Quote

“Data is arguably the most important part when it comes to a machine learning process. You can have the perfect architecture, the perfect training logic, the most modern hardware. If your data is of low quality, or if it's non-existent, if the labeling of your data is of low quality, every effort is futile.” – Speaker, emphasizing the critical role of data quality.

Conclusion:

This video provides a comprehensive demonstration of data annotation using UnitLab, highlighting its versatility and efficiency. The workflow emphasizes a semi-automated approach, leveraging pre-trained models and custom-trained models to accelerate the annotation process. The video underscores the importance of high-quality data in machine learning and showcases how UnitLab can be used to create labeled datasets for various applications, including OCR, audio segmentation, medical imaging, and named entity recognition. The combination of manual review and automated annotation, coupled with data augmentation, results in a robust and reliable dataset for training effective machine learning models.