使用数据流在Google云平台中连接两个json

2条回答

网友
1楼 · 编辑于 2024-04-28 07:00:16

您可以使用CoGroupByKey（其中将使用shuffle），或者如果您的departments集合明显较小，则可以使用side输入来完成此操作
我将用Python编写代码，但您可以使用Java中的相同管道
通过侧面输入，您将：
将您的部门PCollection转换为映射 dept_id到department JSON字典
那你就去坐火车员工PCollection作为主要输入，您可以在其中使用部门id 获取departments PCollection中每个部门的JSON
像这样：
departments = (p | LoadDepts() | 'key_dept' >> beam.Map(lambda dept: (dept['dept_id'], dept))) deps_si = beam.pvalue.AsDict(departments) employees = (p | LoadEmps()) def join_emp_dept(employee, dept_dict): return employee.update(dept_dict[employee['dept_id']]) joined_dicts = employees | beam.Map(join_dicts, dept_dict=deps_si)
使用CoGroupByKey，您可以使用dept_id作为键对两个集合进行分组。这将导致键-值对的PCollection，其中键是部门id，值是部门和该部门员工的两个可重用项
departments = (p | LoadDepts() | 'key_dept' >> beam.Map(lambda dept: (dept['dept_id'], dept))) employees = (p | LoadEmps() | 'key_emp' >> beam.Map(lambda emp: (emp['dept_id'], emp))) def join_lists((k, v)): itertools.product(v['employees'], v['departments']) joined_dicts = ( {'employees': employees, 'departments': departments} | beam.CoGroupByKey() | beam.FlatMap(join_lists) | 'mergedicts' >> beam.Map(lambda (emp_dict, dept_dict): emp_dict.update(dept_dict)) | 'filterfields'>> beam.Map(filter_fields) )

网友
2楼 · 编辑于 2024-04-28 07:00:16

有人要求为这个问题提供一个基于Java的解决方案。下面是这方面的Java代码。它更详细，但本质上是一样的
// First we want to load all departments, and put them into a PCollection // of key-value pairs, where the Key is their identifier. We assume that it is String-type. PCollection<KV<String, Department>> departments = p.apply(new LoadDepts()) .apply("getKey", MapElements.via((Department dept) -> KV.of(dept.getId(), dept))); // We then convert this PCollection into a map-type PCollectionView. // We can access this map directly within a ParDo. PCollectionView<Map<String, Department>> departmentSideInput = departments.apply("ToMapSideInput", View.<String, Department>asMap()); // We load the PCollection of employees PCollection<Employee> employees = p.apply(new LoadEmployees()); // Let us suppose that we will *extend* an employee information with their // Department information. I have assumed the existence of an ExtendedEmployee // class to represent an employee extended with department information. class JoinDeptEmployeeDoFn extends DoFn<Employee, ExtendedEmployee> { @ProcessElement public void processElement(ProcessContext c) { // We obtain the Map-type side input with department information. Map<String, Department> departmentMap = c.sideInput(departmentSideInput); Employee empl = c.element(); Department dept = departmentMap.get(empl.getDepartmentId(), null); if (department == null) return; ExtendedEmployee result = empl.extendWith(dept); c.output(result); } } // We apply the ParDo to extend the employee with department information // and specify that it takes in a departmentSideInput. PCollection<ExtendedEmployee> extendedEmployees = employees.apply( ParDo.of(new JoinDeptEmployeeDoFn()).withSideInput(departmentSideInput));
使用CoGroupByKey，可以使用部门id作为键对两个集合进行分组。这在BeamJavaSDK中的表现是CoGbkResult
// We load the departments, and make them a key-value collection, to Join them // later with employees. PCollection<KV<String, Department>> departments = p.apply(new LoadDepts()) .apply("getKey", MapElements.via((Department dept) -> KV.of(dept.getId(), dept))); // Because we will perform a join, employees also need to be put into // key-value pairs, where their key is their *department id*. PCollection<KV<String, Employee>> employees = p.apply(new LoadEmployees()) .apply("getKey", MapElements.via((Employee empl) -> KV.of(empl.getDepartmentId(), empl))); // We define a DoFn that is able to join a single department with multiple // employees. class JoinEmployeesWithDepartments extends DoFn<KV<String, CoGbkResult>, ExtendedEmployee> { @ProcessElement public void processElement(ProcessContext c) { KV<...> elm = c.element(); // We assume one department with the same ID, and assume that // employees always have a department available. Department dept = elm.getValue().getOnly(departmentsTag); Iterable<Employee> employees = elm.getValue().getAll(employeesTag); for (Employee empl : employees) { ExtendedEmployee result = empl.extendWith(dept); c.output(result); } } } // The syntax for a CoGroupByKey operation is a bit verbose. // In this step we define a TupleTag, which serves as identifier for a // PCollection. final TupleTag<String> employeesTag = new TupleTag<>(); final TupleTag<String> departmentsTag = new TupleTag<>(); // We use the PCollection tuple-tags to join the two PCollections. PCollection<KV<String, CoGbkResult>> results = KeyedPCollectionTuple.of(departmentsTag, departments) .and(employeesTag, employees) .apply(CoGroupByKey.create()); // Finally, we convert the joined PCollections into a kind that // we can use: ExtendedEmployee. PCollection<ExtendedEmployee> extendedEmployees = results.apply("ExtendInformation", ParDo.of(new JoinEmployeesWithDepartments()));

相关问题更多 >

编程相关推荐

热门问题

热门文章