After we obtain sampled predictions from previous step, we use SFT to recognize the question-answering task. Rather than training on human-annotated answer or pseudo-optimal responses generated by GPT-4, we set a self-generated response as a labeled asnwer to remove the depedency on resources ...