使用“索引”批量重命名文件/文件夹

0 投票
5 回答
788 浏览
提问于 2025-04-17 16:29

批量重命名文件和文件夹是一个常见的问题,但经过一些搜索,我发现没有人问过和我类似的问题。

背景:我们把一些生物样本发送给服务提供商,他们会返回一些独特名称的文件,以及一个文本格式的表格,里面包含了文件名和对应的样本等信息:

head samples.txt
fq_file Sample_ID   Sample_name Library_ID  FC_Number   Track_Lanes_Pos
L2369_Track-3885_R1.fastq.gz    S1746_B_7_t B 7 t   L2369_B_7_t 163 6
L2349_Track-3865_R1.fastq.gz    S1726_A_3_t A 3 t   L2349_A_3_t 163 5
L2354_Track-3870_R1.fastq.gz    S1731_A_GFP_c   A GFP c L2354_A_GFP_c   163 5
L2377_Track-3893_R1.fastq.gz    S1754_B_7_c B 7 c   L2377_B_7_c 163 7
L2362_Track-3878_R1.fastq.gz    S1739_B_GFP_t   B GFP t L2362_B_GFP_t   163 6

目录结构(总共有34个目录):

L2369_Track-3885_
   accepted_hits.bam      
   deletions.bed   
   junctions.bed         
   logs
   accepted_hits.bam.bai  
   insertions.bed  
   left_kept_reads.info
L2349_Track-3865_
   accepted_hits.bam      
   deletions.bed   
   junctions.bed         
   logs
   accepted_hits.bam.bai  
   insertions.bed  
   left_kept_reads.info

目标:因为这些文件名没有意义且难以理解,我想把以.bam结尾的文件(保留后缀)和对应的样本名称重命名,并以更合适的方式重新排序。最终结果应该像这样:

7_t_B
   7_t_B..bam      
   deletions.bed   
   junctions.bed         
   logs
   7_t_B.bam.bai  
   insertions.bed  
   left_kept_reads.info
3_t_A
   3_t_A.bam      
   deletions.bed   
   junctions.bed         
   logs
   accepted_hits.bam.bai  
   insertions.bed  
   left_kept_reads.info

我用bash和python(我是新手)拼凑了一个解决方案,但感觉有点复杂。我的问题是,是否还有更简单或更优雅的方法我没有想到?解决方案可以用python、bash或R来实现,也可以用awk,因为我正在尝试学习它。作为一个相对初学者,确实会让事情变得复杂。

这是我的解决方案:

一个包装器把所有内容整理到一起,并给出了工作流程的概念:

#! /bin/bash

# select columns of interest and write them to a file - basenames
tail -n +2 samples.txt |  cut -d$'\t' -f1,3 >> BAMfilames.txt 

# call my little python script that creates a new .sh with the renaming commmands
./renameBamFiles.py

# finally do the renaming
./renameBam.sh

# and the folders to
./renameBamFolder.sh

renameBamFiles.py:

#! /usr/bin/env python
import re

# Read in the data sample file and create a bash file that will remane the tophat output 
# the reanaming will be as follows:
# mv L2377_Track-3893_R1_ L2377_Track-3893_R1_SRSF7_cyto_B
# 

# Set the input file name
# (The program must be run from within the directory 
#  that contains this data file)
InFileName = 'BAMfilames.txt'


### Rename BAM files

# Open the input file for reading
InFile = open(InFileName, 'r')


# Open the output file for writing
OutFileName= 'renameBam.sh'

OutFile=open(OutFileName,'a') # You can append instead with 'a'

OutFile.write("#! /bin/bash"+"\n")
OutFile.write(" "+"\n")


# Loop through each line in the file
for Line in InFile:
    ## Remove the line ending characters
    Line=Line.strip('\n')

    ## Separate the line into a list of its tab-delimited components
    ElementList=Line.split('\t')

    # separate the folder string from the experimental name
    fileroot=ElementList[1]
    fileroot=fileroot.split()

    # create variable names using regex
    folderName=re.sub(r'^(.*)(\_)(\w+).*', r'\1\2\3\2', ElementList[0])
    folderName=folderName.strip('\n')
    fileName = "%s_%s_%s" % (fileroot[1], fileroot[2], fileroot[0])

    command= "for file in %s/accepted_hits.*; do mv $file ${file/accepted_hits/%s}; done" % (folderName, fileName)

    print command
    OutFile.write(command+"\n")  


# After the loop is completed, close the files
InFile.close()
OutFile.close()


### Rename folders

# Open the input file for reading
InFile = open(InFileName, 'r')


# Open the output file for writing
OutFileName= 'renameBamFolder.sh'

OutFile=open(OutFileName,'w') 

OutFile.write("#! /bin/bash"+"\n")
OutFile.write(" "+"\n")


# Loop through each line in the file
for Line in InFile:
    ## Remove the line ending characters
    Line=Line.strip('\n')

    ## Separate the line into a list of its tab-delimited components
    ElementList=Line.split('\t')

    # separate the folder string from the experimental name
    fileroot=ElementList[1]
    fileroot=fileroot.split()

    # create variable names using regex
    folderName=re.sub(r'^(.*)(\_)(\w+).*', r'\1\2\3\2', ElementList[0])
    folderName=folderName.strip('\n')
    fileName = "%s_%s_%s" % (fileroot[1], fileroot[2], fileroot[0])

    command= "mv %s %s" % (folderName, fileName)

    print command

    OutFile.write(command+"\n")  


# After the loop is completed, close the files
InFile.close()
OutFile.close()

RenameBam.sh - 由之前的python脚本创建:

#! /bin/bash

for file in L2369_Track-3885_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/7_t_B}; done
for file in L2349_Track-3865_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/3_t_A}; done
for file in L2354_Track-3870_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/GFP_c_A}; done
(..)

Rename renameBamFolder.sh也非常相似:

mv L2369_Track-3885_R1_ 7_t_B
mv L2349_Track-3865_R1_ 3_t_A
mv L2354_Track-3870_R1_ GFP_c_A
mv L2377_Track-3893_R1_ 7_c_B

因为我在学习,我觉得看到不同的解决方法和思考方式会非常有帮助。

5 个回答

0

看起来你可以简单地用一个 while 循环从索引文件中读取所需的字段。文件的结构不是很明显,所以我假设这个文件是用空格分开的,并且 Sample_Id 实际上包含四个字段(一个复杂的 sample_id,然后是名字中的三个部分)。也许你有一个用制表符分隔的文件,而 Sample_Id 字段里面还有空格?无论如何,如果我的假设不对,这个方法应该很容易调整。

# Skip the annoying field names
tail +1 samples.txt |
while read fq _ c a b chaff; do
    dir=${fq%R1.fastq.gz}
    new="${a}_${b}_$c"
    echo mv "$dir"/accepted_hits.bam "$dir/$new".bam
    echo mv "$dir"/accepted_hits.bam.bai "$dir/$new".bam.bai
    echo mv "$dir" "$new"
done

如果输出结果看起来符合你的要求,可以去掉 echo

0

这里有一种方法可以使用一个脚本来实现。你可以这样运行:

script.sh /path/to/samples.txt /path/to/data

下面是 script.sh 的内容:

# add directory names to an array
while IFS= read -r -d '' dir; do

    dirs+=("$dir")

done < <(find $2/* -type d -print0)


# process the sample list
while IFS=$'\t' read -r -a list; do

    for i in "${dirs[@]}"; do

        # if the directory is in the sample list
        if [ "${i##*/}" == "${list[0]%R1.fastq.gz}" ]; then

            tag="${list[3]}_${list[4]}_${list[2]}"
            new="${i%/*}/$tag"
            bam="$new/accepted_hits.bam"

            # only change name if there's a bam file
            if [ -n $bam ]; then

                mv "$i" "$new"
                mv "$bam" "$new/$tag.bam"
            fi
        fi
    done

done < <(tail -n +2 $1)
2

在bash中,有一种简单的方法:

find . -type d -print |
while IFS= read -r oldPath; do

   parent=$(dirname "$oldPath")
   old=$(basename "$oldPath")
   new=$(awk -v old="$old" '$1~"^"old{print $4"_"$5"_"$3}' samples.txt)

   if [ -n "$new" ]; then
      newPath="${parent}/${new}"
      echo mv "$oldPath" "$newPath"
      echo mv "${newPath}/accepted_hits.bam" "${newPath}/${new}.bam"
   fi
done

在初步测试后,去掉“echo”命令,这样才能真正执行“mv”命令。

如果你的目标目录都在同一层级,就像@triplee的回答所说的,那就更简单了。只需切换到它们的父目录,然后执行:

awk 'NR>1{sub(/[^_]+$/,"",$1); print $1" "$4"_"$5"_"$3}' samples.txt |
while read -r old new; do
   echo mv "$old" "$new"
   echo mv "${new}/accepted_hits.bam" "${new}/${new}.bam"
done

在你预期的输出中,有一个地方你重命名了“.bai”文件,而另一个地方你没有说明是否想重命名。如果你也想重命名它,只需在你选择的任何解决方案中添加:

echo mv "${new}/accepted_hits.bam.bai" "${new}/${new}.bam.bai"

即可。

撰写回答