Nearby tokens should be important because, in a sentence (sequence of words), the current word is highly dependent on neighboring past & future tokens. This intuition is the idea behind the concept of sliding attention.>>> # considering `window_size = 3`, we will consider 1 token to left...