Building Effective Data Pipelines for Large Language Models in Chemistry