These annotations form the basis for quality filtering during the filtering step. 7) URL Block-Listing: Block-listing identifies documents to be blocked from being added to IBM's curated pre-training dataset. The block list is continuously maintained and includes URLs known for disseminating pirated...