CIC-YNU-IoTMal: Multilayer Dataset for Static & Dynamic Analysis of IoT Malware Behaviour
By Canadian Institute for Cybersecurity (CIC)
Key Concepts
- IoT Security: The practice of protecting internet-connected devices from malicious exploitation.
- CIC-YNU-IoT-MAL-2026: A multi-layer, cross-architectural dataset for IoT malware research.
- Static Analysis: Examining code (e.g., byte code, strings) without executing it.
- Dynamic Analysis: Observing malware behavior during execution in a controlled environment.
- Heterogeneous Architectures: IoT devices running on diverse CPU architectures (ARM, MIPS, MIPSEL, x86).
- Polymorphism/Packing: Techniques used by malware to change its appearance or structure to evade detection.
- Zero-Day Attack: Previously unknown vulnerabilities or malware variants.
- Honey Pot: A decoy system used to attract and analyze malicious activity.
1. Main Topics and Key Points
The presentation addresses the critical security challenges posed by the rapid expansion of the IoT ecosystem. With over 21.3 billion connected devices in 2025 (projected to reach 40 billion by 2030), the attack surface has grown exponentially.
- The Problem: Traditional static analysis is often defeated by malware packing and polymorphism. Dynamic analysis, while effective, is resource-intensive.
- The Solution: The CIC-YNU-IoT-MAL-2026 dataset, which integrates both static and dynamic analysis across four major architectures (ARM, MIPS, MIPSEL, x86).
- Key Statistics: Malware attacks on IoT devices surged from 33 million in 2018 to 129 million in 2023. Approximately 57% of connected devices are considered vulnerable.
2. Real-World Applications
- Cross-Architecture Generalization: Researchers can use the dataset to test if a mitigation technique developed for one architecture (e.g., ARM) is effective on another (e.g., MIPS).
- Zero-Day Detection: By training models on specific malware families present in one architecture and testing them on others, researchers can evaluate the ability of algorithms to detect unseen variants.
- Adversarial Robustness: The dataset allows for testing how well security detectors perform against packed or mutated malware.
3. Methodology and Framework
The development of the dataset followed a rigorous, automated three-tier protocol:
- Management Tier: Controls the workflow, manages the task queue, and orchestrates the execution of malware samples.
- Worker (Execution) Tier: Uses a Python orchestrator, VM console bridge, and network monitor to execute malware within a safe, isolated environment.
- Isolation Tier: Utilizes QEMU (Quick Emulator) with OpenWrt as the guest OS to simulate real-world IoT device behavior without risking external network infection.
Benign Sample Generation: Since benign binaries are scarce, the team utilized an automated pipeline integrated with ChatGPT to generate benign samples that mimic real-world traffic behavior.
4. Key Arguments and Evidence
Daniel argues that current IoT security is hindered by "data scarcity" and "architectural heterogeneity."
- Evidence: He highlights that most existing datasets are restricted, static-only, or limited to a single architecture.
- Supporting Data: The team performed extensive feature extraction, including 40 features for PCAP files and 461 features for system traces (S-trace) and system activity reports (SAR).
5. Notable Quotes
- "The heterogeneity that exists in the architectures of their design... makes the malware to adapt to the architecture. If we develop a mitigation technique for one architecture, it is unable to adapt to another." — Daniel
- "We are looking at developing a malware [dataset] that is enriched cross-architectural that can be provided free of cost and publicly available to the community." — Daniel
6. Data Processing and Validation
- Feature Extraction: The team used
tcpdumpfor network traffic, Linuxstracefor system calls, andsarfor system activity. - Validation: Eight machine learning algorithms (including Decision Trees, Random Forest, and XGBoost) were used to validate the dataset.
- Handling Imbalance: To address the high volume of specific malware families (like Mirai), the team employed SMOTE (Synthetic Minority Over-sampling Technique) and class-weighting in their training models.
7. Synthesis and Conclusion
The CIC-YNU-IoT-MAL-2026 dataset represents a significant advancement in IoT security research by providing a publicly available, multi-architectural, and multi-behavioral resource. By combining static and dynamic analysis, the dataset enables researchers to build more robust, generalized, and adaptive intrusion detection systems. The project is open for collaboration, with future goals to expand the number of malware families included in the repository.
Chat with this Video
AI-PoweredLoad the transcript when you're ready to chat so the initial page stays lighter.