Job Title: Software Engineer - Big Data
Department: IDP
About Us
HG Insights is the global leader in technology intelligence, delivering actionable AI driven insights through advanced data science and scalable big data solutions. Our Big Data Insights Platform processes billions of unstructured documents and powers a vast data lake, enabling enterprises to make strategic, data-driven decisions. Join our team to solve complex data challenges at scale and shape the future of B2B intelligence.
What You’ll Do:
- Build, and optimize large-scale distributed data pipelines for processing billions of unstructured documents using Databricks, Apache Spark, and cloud-native big data tools
- Scale enterprise-grade big-data systems, including data lakes, ETL/ELT workflows, and syndication platforms for customer-facing Insights-as-a-Service (InaaS) products.
- Implement cutting-edge solutions for data ingestion, transformation, and analytics using Hadoop/Spark ecosystems, Elasticsearch, and cloud services (AWS EC2, S3, EMR).
- Drive system reliability through automation, CI/CD pipelines (Docker, Kubernetes, Terraform), and infrastructure-as-code practices.
- Implement data orchestration strategies using Airflow to manage multi-cloud workflows across AWS/Azure/GCP, Kubernetes clusters, and hybrid environments
What You’ll Be Responsible For
- Building & Troubleshooting complex big data pipelines, including performance tuning of Spark jobs, query optimization, and data quality enforcement.
- Collaborating in agile workflows (daily stand-ups, sprint planning) to deliver features rapidly while maintaining system stability.
- Ensuring security and compliance across data workflows, including access controls, encryption, and governance policies.
What You’ll Need
BS/MS/Ph.D. in Computer Science or related field, with 5+ years of experience building production-grade big data systems.Extensive experience in Scala/Java for Spark development, including optimization of batch/streaming jobs and debugging distributed workflows.Airflow orchestration (DAGs, operators, sensors) and integration with Spark/DatabricksDistributed workflow scheduling and dependency management257.Performance tuning of Airflow DAGs and Spark jobs in multi-tenant environmentsProven track record with:Databricks, Hadoop/Spark ecosystems, and SQL/NoSQL databases (MySQL, Elasticsearch).Cloud platforms (AWS EC2, S3, EMR) and infrastructure-as-code tools (Terraform, Kubernetes).RESTful APIs, microservices architectures, and CI/CD automation.5+ years of designing, modeling and building big data pipelines in an enterprise work setting.