The Food-101 dataset consists of 101 food categories with 750 training and 250 test images per category, making a total of 101k images. The labels for the test images have been manually cleaned, while the training set contains some noise.
PyTorch (pytorch.org, accessed on December 1, 2023) was used for code implementation, and the University of Arizona’s high-performance computing platform was used for all training. Performance metrics The dataset was used to create a classification system, and four main analytical metrics were cr...
但使用Food2K数据集预训练仍然取得了性能增益,这是因为食品图像检索任务对目标数据集之间的差异较为敏感(ETH Food-101和Vireo Food-172),并间接表明来自Food2K的图像类别和尺度的多样性提升了食品图像检索任务的泛化性。
Online Notebooks to train Faster RCNN and Retinanet models on the dataset using Google Colaboratory are available here Faster RCNN Pytorch https://drive.google.com/open?id=1CDQ5cIA8qsdm-OinbfPKM5DuoI6ewvZH RetinaNet Tensorflow https://drive.google.com/open?id=1KxP-j0TSQ_PY7xkJ4JNRyMnLv7kj...
The software used was Python 3.7 and the PyTorch framework.We use the pretrained weights from the ImageNet22k dataset for parameter initialization. During the training phase, the input image size is randomly cropped (1-crop) to 224 × 224, followed by random horizontal flipping for image ...
For the Food-101 dataset, the results are summarized in Table 1. We observe that the multimodal approach, which combines both image and text features, consistently outperforms the image-only and text-only methods across all model variants. Specifically, the ViT-L/14 model achieves the highest ...
For instance, DCL [35] performed worse on the ETH Food-101 dataset compared to other fine-grained datasets, possibly because it fails to account for both the texture information of shallow networks and the differences in feature distribution within the same category. Additionally, some fine-...
For instance, DCL [35] performed worse on the ETH Food-101 dataset compared to other fine-grained datasets, possibly because it fails to account for both the texture information of shallow networks and the differences in feature distribution within the same category. Additionally, some fine-...
For the Food-101 dataset, the results are summarized in Table 1. We observe that the multimodal approach, which combines both image and text features, consistently outperforms the image-only and text-only methods across all model variants. Specifically, the ViT-L/14 model achieves the highest ...
For instance, DCL [35] performed worse on the ETH Food-101 dataset compared to other fine-grained datasets, possibly because it fails to account for both the texture information of shallow networks and the differences in feature distribution within the same category. Additionally, some fine-...