使用“索引”批量重命名文件/文件夹
批量重命名文件和文件夹是一个常见的问题,但经过一些搜索,我发现没有人问过和我类似的问题。
背景:我们把一些生物样本发送给服务提供商,他们会返回一些独特名称的文件,以及一个文本格式的表格,里面包含了文件名和对应的样本等信息:
head samples.txt
fq_file Sample_ID Sample_name Library_ID FC_Number Track_Lanes_Pos
L2369_Track-3885_R1.fastq.gz S1746_B_7_t B 7 t L2369_B_7_t 163 6
L2349_Track-3865_R1.fastq.gz S1726_A_3_t A 3 t L2349_A_3_t 163 5
L2354_Track-3870_R1.fastq.gz S1731_A_GFP_c A GFP c L2354_A_GFP_c 163 5
L2377_Track-3893_R1.fastq.gz S1754_B_7_c B 7 c L2377_B_7_c 163 7
L2362_Track-3878_R1.fastq.gz S1739_B_GFP_t B GFP t L2362_B_GFP_t 163 6
目录结构(总共有34个目录):
L2369_Track-3885_
accepted_hits.bam
deletions.bed
junctions.bed
logs
accepted_hits.bam.bai
insertions.bed
left_kept_reads.info
L2349_Track-3865_
accepted_hits.bam
deletions.bed
junctions.bed
logs
accepted_hits.bam.bai
insertions.bed
left_kept_reads.info
目标:因为这些文件名没有意义且难以理解,我想把以.bam结尾的文件(保留后缀)和对应的样本名称重命名,并以更合适的方式重新排序。最终结果应该像这样:
7_t_B
7_t_B..bam
deletions.bed
junctions.bed
logs
7_t_B.bam.bai
insertions.bed
left_kept_reads.info
3_t_A
3_t_A.bam
deletions.bed
junctions.bed
logs
accepted_hits.bam.bai
insertions.bed
left_kept_reads.info
我用bash和python(我是新手)拼凑了一个解决方案,但感觉有点复杂。我的问题是,是否还有更简单或更优雅的方法我没有想到?解决方案可以用python、bash或R来实现,也可以用awk,因为我正在尝试学习它。作为一个相对初学者,确实会让事情变得复杂。
这是我的解决方案:
一个包装器把所有内容整理到一起,并给出了工作流程的概念:
#! /bin/bash
# select columns of interest and write them to a file - basenames
tail -n +2 samples.txt | cut -d$'\t' -f1,3 >> BAMfilames.txt
# call my little python script that creates a new .sh with the renaming commmands
./renameBamFiles.py
# finally do the renaming
./renameBam.sh
# and the folders to
./renameBamFolder.sh
renameBamFiles.py:
#! /usr/bin/env python
import re
# Read in the data sample file and create a bash file that will remane the tophat output
# the reanaming will be as follows:
# mv L2377_Track-3893_R1_ L2377_Track-3893_R1_SRSF7_cyto_B
#
# Set the input file name
# (The program must be run from within the directory
# that contains this data file)
InFileName = 'BAMfilames.txt'
### Rename BAM files
# Open the input file for reading
InFile = open(InFileName, 'r')
# Open the output file for writing
OutFileName= 'renameBam.sh'
OutFile=open(OutFileName,'a') # You can append instead with 'a'
OutFile.write("#! /bin/bash"+"\n")
OutFile.write(" "+"\n")
# Loop through each line in the file
for Line in InFile:
## Remove the line ending characters
Line=Line.strip('\n')
## Separate the line into a list of its tab-delimited components
ElementList=Line.split('\t')
# separate the folder string from the experimental name
fileroot=ElementList[1]
fileroot=fileroot.split()
# create variable names using regex
folderName=re.sub(r'^(.*)(\_)(\w+).*', r'\1\2\3\2', ElementList[0])
folderName=folderName.strip('\n')
fileName = "%s_%s_%s" % (fileroot[1], fileroot[2], fileroot[0])
command= "for file in %s/accepted_hits.*; do mv $file ${file/accepted_hits/%s}; done" % (folderName, fileName)
print command
OutFile.write(command+"\n")
# After the loop is completed, close the files
InFile.close()
OutFile.close()
### Rename folders
# Open the input file for reading
InFile = open(InFileName, 'r')
# Open the output file for writing
OutFileName= 'renameBamFolder.sh'
OutFile=open(OutFileName,'w')
OutFile.write("#! /bin/bash"+"\n")
OutFile.write(" "+"\n")
# Loop through each line in the file
for Line in InFile:
## Remove the line ending characters
Line=Line.strip('\n')
## Separate the line into a list of its tab-delimited components
ElementList=Line.split('\t')
# separate the folder string from the experimental name
fileroot=ElementList[1]
fileroot=fileroot.split()
# create variable names using regex
folderName=re.sub(r'^(.*)(\_)(\w+).*', r'\1\2\3\2', ElementList[0])
folderName=folderName.strip('\n')
fileName = "%s_%s_%s" % (fileroot[1], fileroot[2], fileroot[0])
command= "mv %s %s" % (folderName, fileName)
print command
OutFile.write(command+"\n")
# After the loop is completed, close the files
InFile.close()
OutFile.close()
RenameBam.sh - 由之前的python脚本创建:
#! /bin/bash
for file in L2369_Track-3885_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/7_t_B}; done
for file in L2349_Track-3865_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/3_t_A}; done
for file in L2354_Track-3870_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/GFP_c_A}; done
(..)
Rename renameBamFolder.sh也非常相似:
mv L2369_Track-3885_R1_ 7_t_B
mv L2349_Track-3865_R1_ 3_t_A
mv L2354_Track-3870_R1_ GFP_c_A
mv L2377_Track-3893_R1_ 7_c_B
因为我在学习,我觉得看到不同的解决方法和思考方式会非常有帮助。
5 个回答
看起来你可以简单地用一个 while
循环从索引文件中读取所需的字段。文件的结构不是很明显,所以我假设这个文件是用空格分开的,并且 Sample_Id
实际上包含四个字段(一个复杂的 sample_id,然后是名字中的三个部分)。也许你有一个用制表符分隔的文件,而 Sample_Id
字段里面还有空格?无论如何,如果我的假设不对,这个方法应该很容易调整。
# Skip the annoying field names
tail +1 samples.txt |
while read fq _ c a b chaff; do
dir=${fq%R1.fastq.gz}
new="${a}_${b}_$c"
echo mv "$dir"/accepted_hits.bam "$dir/$new".bam
echo mv "$dir"/accepted_hits.bam.bai "$dir/$new".bam.bai
echo mv "$dir" "$new"
done
如果输出结果看起来符合你的要求,可以去掉 echo
。
这里有一种方法可以使用一个脚本来实现。你可以这样运行:
script.sh /path/to/samples.txt /path/to/data
下面是 script.sh
的内容:
# add directory names to an array
while IFS= read -r -d '' dir; do
dirs+=("$dir")
done < <(find $2/* -type d -print0)
# process the sample list
while IFS=$'\t' read -r -a list; do
for i in "${dirs[@]}"; do
# if the directory is in the sample list
if [ "${i##*/}" == "${list[0]%R1.fastq.gz}" ]; then
tag="${list[3]}_${list[4]}_${list[2]}"
new="${i%/*}/$tag"
bam="$new/accepted_hits.bam"
# only change name if there's a bam file
if [ -n $bam ]; then
mv "$i" "$new"
mv "$bam" "$new/$tag.bam"
fi
fi
done
done < <(tail -n +2 $1)
在bash中,有一种简单的方法:
find . -type d -print |
while IFS= read -r oldPath; do
parent=$(dirname "$oldPath")
old=$(basename "$oldPath")
new=$(awk -v old="$old" '$1~"^"old{print $4"_"$5"_"$3}' samples.txt)
if [ -n "$new" ]; then
newPath="${parent}/${new}"
echo mv "$oldPath" "$newPath"
echo mv "${newPath}/accepted_hits.bam" "${newPath}/${new}.bam"
fi
done
在初步测试后,去掉“echo”命令,这样才能真正执行“mv”命令。
如果你的目标目录都在同一层级,就像@triplee的回答所说的,那就更简单了。只需切换到它们的父目录,然后执行:
awk 'NR>1{sub(/[^_]+$/,"",$1); print $1" "$4"_"$5"_"$3}' samples.txt |
while read -r old new; do
echo mv "$old" "$new"
echo mv "${new}/accepted_hits.bam" "${new}/${new}.bam"
done
在你预期的输出中,有一个地方你重命名了“.bai”文件,而另一个地方你没有说明是否想重命名。如果你也想重命名它,只需在你选择的任何解决方案中添加:
echo mv "${new}/accepted_hits.bam.bai" "${new}/${new}.bam.bai"
即可。